Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.
The course is structured in three blocks. The first part (Day 1) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research, and we will leave some time at the end of class to go over one or two.
NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The central part of the course (Days 2 and 3) will focus on how we can develop our own set of structured http requests to query an API. In our second session, we will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Sunlight Congress API (to query parliamentary speeches in the US). In the third session, we will learn how to use the most popular R packages to query social media APIs: rtweet, streamR, and Rfacebook. These packages allow researchers to collect tweets filtered by keywords, location, and language in real time, and to scrape public Facebook pages, including likes and comments. The process of collecting and storing the data will be illustrated with examples from published research on social media.
An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. The last block of the course (Days 4 and 5) teaches good practices to clean and rearrange data obtained from the web. We will start with two data formats that are relatively new to social sciences – text and network data. Through a series of applied examples, the materials here will explain how to convert a corpus of documents into a data matrix that is ready for analysis – from dealing with encoding issues, preprocessing text in different languages, and efficiently building a document-feature matrix with quanteda – and how to work with network data – covering the basics of network analysis, how to identify nodes and edges, and how to create an adjacency matrix with igraph. During this session, we will also learn how to extract data from PDF files, both in text and table formats. The course will conclude with a closer look at best practices in statistical computing, building upon the different examples used during the first four days. By learning how to efficiently parallelize loops and work with lists, students will be able to scale up their data collection processes. We will also cover how to merge datasets from different sources, in cases with both identical and similar merging keys (i.e. when a numeric ID is the same across datasets, but also when only country names with slightly different spellings are the only variable that is common to multiple datasets), and how to efficiently compute summary statistics and other aggregated estimates based on a data frame with dplyr.
The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.
After the course, students will have an advanced understanding of the web and social data available for social science research, and will be equipped with the technical skills necessary to collect and clean such datasets on their own.