ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Automated Collection of Web and Social Data

Course Dates and Times

Monday 31 July - Friday 4 August

14:00-17:30

Please see Timetable for full details.

Pablo Barberá

pablobarbera@gmail.com

University of Southern California

An increasingly vast wealth of data is freely available on the web -- from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Facebook and Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.


Instructor Bio

 

Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.

Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.

His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.

  @p_barbera

Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.

The course is structured in three blocks. The first part (Day 1) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research, and we will leave some time at the end of class to go over one or two.

NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The central part of the course (Days 2 and 3) will focus on how we can develop our own set of structured http requests to query an API. In our second session, we will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Sunlight Congress API (to query parliamentary speeches in the US). In the third session, we will learn how to use the most popular R packages to query social media APIs: rtweet, streamR, and Rfacebook. These packages allow researchers to collect tweets filtered by keywords, location, and language in real time, and to scrape public Facebook pages, including likes and comments. The process of collecting and storing the data will be illustrated with examples from published research on social media.

An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. The last block of the course (Days 4 and 5) teaches good practices to clean and rearrange data obtained from the web. We will start with two data formats that are relatively new to social sciences – text and network data. Through a series of applied examples, the materials here will explain how to convert a corpus of documents into a data matrix that is ready for analysis – from dealing with encoding issues, preprocessing text in different languages, and efficiently building a document-feature matrix with quanteda – and how to work with network data – covering the basics of network analysis, how to identify nodes and edges, and how to create an adjacency matrix with igraph. During this session, we will also learn how to extract data from PDF files, both in text and table formats. The course will conclude with a closer look at best practices in statistical computing, building upon the different examples used during the first four days. By learning how to efficiently parallelize loops and work with lists, students will be able to scale up their data collection processes. We will also cover how to merge datasets from different sources, in cases with both identical and similar merging keys (i.e. when a numeric ID is the same across datasets, but also when only country names with slightly different spellings are the only variable that is common to multiple datasets), and how to efficiently compute summary statistics and other aggregated estimates based on a data frame with dplyr.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

After the course, students will have an advanced understanding of the web and social data available for social science research, and will be equipped with the technical skills necessary to collect and clean such datasets on their own.

 

The course will assume familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Day Topic Details
Monday Webscraping

Session 1. Basics of webscraping. Scraping web data in table format and unstructured format.

Session 2. Scraping data behind web forms with Selenium.

Tuesday Working with APIs

Session 1. What is an API? Interacting with an API. Session 2. Applied examples: the New York Times API and the Sunlight Congress API.

Wednesday Collecting social media data

Session 1. Collecting Twitter data: tweets filtered by keyword and location; user profiles and tweets.

Session 2. Collecting data from Facebook pages

Thursday Cleaning unstructured data

Session 1. Cleaning text data using regular expressions; building network datasets.

Session 2. Extracting data from PDF files: text, tables, metadata.

Friday Data manipulation

Session 1. Scaling up data collection: for loops, vectorization, working with lists, merging multiple datasets.

Session 2.  Reshaping data with dplyr.

Software Requirements

The course will use the open-source software R, which is freely available for download at https://www.r-project.org/ . We will interact with R through RStudio, which can be downloaded at https://www.rstudio.com/products/rstudio/download/ Students should download the most recent version at the time of the course (currently R 3.3.3 and RStudio 1.0.136). We will also utilize the following R packages: rvest, jsonlite, httr, igraph, rtweet, streamR, quanteda, Rfacebook, dplyr.

Hardware Requirements

Students are expected to bring their own laptops to class. A laptop with a standard setup (4GB of RAM or more, at least 10GB of free disk space) should be sufficient.

Literature

Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R. Packt Publishing Ltd.

Recommended Courses to Cover Before this One

SA102 - R Basics

SA111 - Effective Data Management with R

Recommended Courses to Cover After this One

SC104 - Big Data Analysis in the Social Sciences

SA110 - Introduction to Exploratory Network Analysis

SD103 - Introduction to Manual and Computer-Assisted Content Analysis