Data from online sources is gaining importance in different fields of social science: citizens, parties, organisations as well as states are present online, communicate and leave traces of their activities. Even traditional ‘offline’ data like election results or press releases are increasingly available online, decreasing the costs of data collection. Hence, web and social media data provides a rich source for addressing many political science questions. This course will provide you with an overview of the new sources of data now available, and the computational tools required to collect the data, clean it, and organise it in a format amenable to statistical analysis.
The course will give you an applied overview of the skills required to automatically collect data from the web, including many exercises to do during class.
The afternoon of the first day will introduce the basics of webscraping, based on an introduction to webpage structures. We will learn how to download files (pdf, excel etc.) and scrape data from simple webpages and tables (e.g. electoral results or Wikipedia tables). We also practice how to process these files in R. In the second session, we will learn how to select the data we need, based on regular expressions and web page formatting (CSS Selectors and xpaths).
The morning of the second day will focus on automating what we have learned during the first day: We will discuss how to implement loops and functions in R to gather data more effectively. This is particularly useful when scraping data that is spread across multiple pages and requires following many hyperlinks. The course will also give you an overview of more advanced webscraping techniques and guidance on when to use them. That includes APIs (through which institutions, newspapers and organisations increasingly make their data available for research) and RSS Feeds (used primarily by news pages), APIs of social media pages and the scraping of dynamic pages (that is, pages that reload after scrolling or change upon user interaction) with RSelenium.
Finally, in the afternoon, we will practice how to implement reproducible and reusable workflows for scraping. Especially when gathering data from different sources or repeatedly updating data collections, getting organised can make your work more efficient. This will also give you the opportunity to put your new skills into practice.
All sessions will include a lecture as well as practical sessions with exercises that apply the learned techniques to new data sources. While we will not focus on data analysis, we will use simple methods to understand the data we gather.
Independent of your previous experience with automated data collection, you will leave this course with a solid understanding of the web and social data available for social science research and will know how to process it for your own research.