ECPR Summer School
Central European University, Budapest
27 July - 12 August 2017




SC103 - Automated Collection of Web and Social Data

Instructor Details

Instructor Photo

Pablo Barberá

Institution:
University of Southern California

Instructor Bio

Pablo Barberá gained his PhD in Politics from New York University.

He is currently an Assistant Professor of International Relations at the University of Southern California, and a former Moore-Sloan Fellow at the Center for Data Science at NYU.

Pablo's primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

He is an active contributor to the open source community and has authored several R packages to mine social media data.


Course Dates and Times

Monday 31 July - Friday 4 August

14:00-17:30

Please see Timetable for full details.

Location
Building: N13 Room: 517
Prerequisite Knowledge

The course will assume familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Short Outline

An increasingly vast wealth of data is freely available on the web -- from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Facebook and Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Long Course Outline

Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.

The course is structured in three blocks. The first part (Day 1) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research, and we will leave some time at the end of class to go over one or two.

NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The central part of the course (Days 2 and 3) will focus on how we can develop our own set of structured http requests to query an API. In our second session, we will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Sunlight Congress API (to query parliamentary speeches in the US). In the third session, we will learn how to use the most popular R packages to query social media APIs: rtweet, streamR, and Rfacebook. These packages allow researchers to collect tweets filtered by keywords, location, and language in real time, and to scrape public Facebook pages, including likes and comments. The process of collecting and storing the data will be illustrated with examples from published research on social media.

An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. The last block of the course (Days 4 and 5) teaches good practices to clean and rearrange data obtained from the web. We will start with two data formats that are relatively new to social sciences – text and network data. Through a series of applied examples, the materials here will explain how to convert a corpus of documents into a data matrix that is ready for analysis – from dealing with encoding issues, preprocessing text in different languages, and efficiently building a document-feature matrix with quanteda – and how to work with network data – covering the basics of network analysis, how to identify nodes and edges, and how to create an adjacency matrix with igraph. During this session, we will also learn how to extract data from PDF files, both in text and table formats. The course will conclude with a closer look at best practices in statistical computing, building upon the different examples used during the first four days. By learning how to efficiently parallelize loops and work with lists, students will be able to scale up their data collection processes. We will also cover how to merge datasets from different sources, in cases with both identical and similar merging keys (i.e. when a numeric ID is the same across datasets, but also when only country names with slightly different spellings are the only variable that is common to multiple datasets), and how to efficiently compute summary statistics and other aggregated estimates based on a data frame with dplyr.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

After the course, students will have an advanced understanding of the web and social data available for social science research, and will be equipped with the technical skills necessary to collect and clean such datasets on their own.

 

Day-to-Day Schedule

Day 
Topic 
Details 
MondayWebscraping

Session 1. Basics of webscraping. Scraping web data in table format and unstructured format.

Session 2. Scraping data behind web forms with Selenium.

TuesdayWorking with APIs

Session 1. What is an API? Interacting with an API. Session 2. Applied examples: the New York Times API and the Sunlight Congress API.

WednesdayCollecting social media data

Session 1. Collecting Twitter data: tweets filtered by keyword and location; user profiles and tweets.

Session 2. Collecting data from Facebook pages

ThursdayCleaning unstructured data

Session 1. Cleaning text data using regular expressions; building network datasets.

Session 2. Extracting data from PDF files: text, tables, metadata.

FridayData manipulation

Session 1. Scaling up data collection: for loops, vectorization, working with lists, merging multiple datasets.

Session 2.  Reshaping data with dplyr.

Software Requirements

The course will use the open-source software R, which is freely available for download at https://www.r-project.org/ . We will interact with R through RStudio, which can be downloaded at https://www.rstudio.com/products/rstudio/download/ Students should download the most recent version at the time of the course (currently R 3.3.3 and RStudio 1.0.136). We will also utilize the following R packages: rvest, jsonlite, httr, igraph, rtweet, streamR, quanteda, Rfacebook, dplyr.

Hardware Requirements

Students are expected to bring their own laptops to class. A laptop with a standard setup (4GB of RAM or more, at least 10GB of free disk space) should be sufficient.

Literature

Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R. Packt Publishing Ltd.

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

SA102 - R Basics

SA111 - Effective Data Management with R

Recommended Courses After

SC104 - Big Data Analysis in the Social Sciences

SA110 - Introduction to Exploratory Network Analysis

SD103 - Introduction to Manual and Computer-Assisted Content Analysis

Additional Information

Disclaimer

The information contained in this course description form may be subject to subsequent adaptations (e.g. taking into account new developments in the field, specific participant demands, group size etc.). Registered participants will be informed in due time in case of adaptations.

Note from the Academic Convenors

By registering to this course, you certify that you possess the prerequisite knowledge that is requested to be able to follow this course. The instructor will not teach these prerequisite items. If you are not sure if you possess this knowledge to a sufficient level, we suggest you contact the instructor before you proceed with your registration.


Share this page
 

"The less the power, the greater the desire to exercise it" - Bernard Levin


Back to top