Data from online sources is gaining importance in different fields of social science. Citizens, parties, organisations as well as states are present online, communicate and leave traces of their activities. Even traditional ‘offline’ data like election results or press releases are increasingly available online, decreasing the costs of data collection. Hence, web and social media data provides a rich source for addressing many political science questions.
This course gives you an overview of the new sources of data now available, and the computational tools required to collect it, clean it, and organise it in a format amenable to statistical analysis.
I introduce the basics of webscraping, based on an introduction to webpage structures. In the first session, we learn how to download files (pdf, excel etc.) and scrape data from simple webpages and tables (e.g. electoral results or Wikipedia tables). We also practice how to process these files in R. In the second session, we learn how to select the data we need, based on regular expressions and web page formatting (CSS Selectors and xpaths).
Dedicated to gathering data through requests. Increasingly, institutions, newspapers and organisations make their data available for research through databases or APIs. In the first session, we learn how to query online databases automatically. In the second session, we focus on APIs and RSS Feeds. We learn how to formulate queries to APIs using HTTP and how to read the resulting data, which is often in JSON or XML format. This will also be the basis for the focus on social media on the fourth day, given that many social media pages use similar request structures and data formats.
We focus on automating what we have learned during the first two days. In the first session, we discuss how to implement loops and functions in R to gather data more effectively. This is particularly useful when scraping data that is spread across multiple pages and requires following many hyperlinks. In the second session, we focus on scraping dynamic pages with RSelenium; that is, pages that reload after scrolling or change upon user interaction. We learn how to navigate a browser from within R to automate these interactions.
Social media data. The first session gives an overview of available social media data, also highlighting lesser known platforms. The second session will focus on Twitter. We will use the Streaming API which collects tweets filtered by keywords and locations, as well as the REST API which collects the tweets of specific users. NB: several social media platforms are revising their rules on data access, so the exact content of this day may change closer to the time.
We practice how to implement reproducible and reusable workflows for scraping. Especially when gathering data from different sources or repeatedly updating data collections, getting organised can make your work more efficient. In the second session, we’ll learn how to tackle common problems related to data cleaning, dealing with encodings and generally making the data suitable for analysis. If participants are interested, we can also use this final session to discuss specific webscraping challenges in students’ current projects.
All sessions will include a lecture as well as practical sessions with exercises that apply the learned techniques to new data sources. While we will not focus on data analysis, we will use simple methods to understand the data we gather.
Independent of your previous experience with automated data collection, you will leave the course with a comprehensive understanding of the web and social data available for social science research, and you will be able to use this data for your own research.