ECPR Summer School
Central European University, Budapest
26 July - 10 August 2018




SC104 - Automated Collection of Web and Social Data

Instructor Details

Instructor Photo

Pablo Barberá

Institution:
University of Southern California

Instructor Bio

 

Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.

Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.

His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.

  @p_barbera


Course Dates and Times

Monday 30 July - Friday 3 August

14:00-15:30 / 16:00-17:30

Prerequisite Knowledge

The course will assume intermediate familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Short Outline

An increasingly vast wealth of data is freely available on the web -- from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Facebook and Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Tasks for ECTS Credits

  • Participants attending the course: 2 credits (pass/fail grade) The workload for the calculation of ECTS credits is based on the assumption that students attend classes and carry out the necessary reading and/or other work prior to, and after, classes.
  • Participants attending the course and completing one task (see below): 3 credits (to be graded)
  • Participants attending the course, and completing two tasks (see below): 4 credits (to be graded)

For an additional credit, participants will be required to submit 3 out of the 4 challenges in the Reading List, see section below.

For an additional 2 credits, particpants will also need to submit a 5-page project applying techniques covered in the course to a substantive social science project.

Long Course Outline

Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.

The course is structured in three blocks. The first part (Days 1 and 2) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research.

NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The second part of the course (Day 3) will focus on how we can develop our own set of structured http requests to query an API. We will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Clarifai API (to automatically tag visual content with machine learning).

The third part (Days 4 and 5) will teach how to collect and analyze data from social media sites. We will begin with an overview of the research opportunities and challenges of using social media data in the social sciences. We will then discuss the data available through Twitter’s REST and Streaming API. As part of the guided coding exercises, we will learn how to collect tweets filtered by keywords, location, and language in real time using different R packages; and how to analyze the data to find the most mentioned hashtags and users and to map the location of the tweets. Our last session will demonstrate how to scrape public Facebook pages through the Graph API using the Rfacebook package. As an illustration of how to analyze tweets and Facebook posts collected with these methods, we will use a dictionary method to characterize politicians’ rhetoric on social media.

An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. Our last session will also discuss some good practices to clean and rearrange data obtained from the web.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

After the course, students will have an advanced understanding of the web and social data available for social science research and will be equipped with the technical skills necessary to collect and clean such datasets on their own.

Day-to-Day Schedule

Day 
Topic 
Details 
1Webscraping (part 1)

Session 1. Basics of webscraping. Scraping web data in table format and unstructured format.

Session 2. Scraping data in unstructured format. Loops, vectorised functions, and lists in R.

2Webscraping (part 2)

Session 1. Scraping data behind web forms with Selenium.

Session 2. Extracting media text from newspaper articles using RSS feeds.

3Working with APIs.

Session 1. What is an API? Interacting with the New York Times and Clarifai APIs.

Session 2. Extracting data from PDF files: text, tables, metadata.

4Collecting and analysing social media data (part 1)

Session 1. Collecting Twitter data from the Streaming API: tweets filtered by keyword and location.

Session 2. Collecting Twitter data from the REST API: user profiles and recent tweets.

5Collecting and analysing social media data (part 2)

Session 1. Collecting data from Facebook pages.

Session 2. Dealing with character encoding issues. Merging and reshaping datasets. Exception handling.

Day-to-Day Reading List

Day 
Readings 
1

No challenges assigned.

2

Challenge: scraping the American President Project website

3

Challenge: scraping articles from The Guardian

4

Challenge: analysing the Amnesty International Annual Report

5

Challenge: content analysis of a politician’s Twitter feed

Software Requirements

This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the course is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

Hardware Requirements

Students are expected to bring their own laptops to class. There are no specific requirements other than being able to use a browser (Google Chrome is the recommended option, but others should work too).

Literature

Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. (2017). “Measuring public opinion with social media data.” In The Oxford Handbook of Polling and Survey Methods.

Lazer, D. & and Radford, J. (2017). “Data ex Machina: Introduction to Big Data.” Annual Review of Sociology.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.

Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R. Packt Publishing Ltd.

Ruths, D., & Pfeffer, J. (2014). “Social media for large studies of behavior.” Science, 346(6213), 1063-1064.

Salganik, M. (2017). Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press.

Steinert-Threlkeld, Z. (2018) Twitter as Data. Cambridge University Press.

Theocharis, Y., Barberá, P., Fazekas, Z., Popa, S. A. and Parnet, O. (2016), “A Bad Workman Blames His Tweets: The Consequences of Citizens’ Uncivil Twitter Use When Interacting With Party Candidates.” Journal of Communication, 66: 1007–1031.

Tucker, J. A., Theocharis, Y., Roberts, M. E., & Barberá, P. (2017). “From Liberation to Turmoil: Social Media And Democracy.” Journal of Democracy, 28(4), 46-59.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

Summer School

R Basics

Effective Data Management with R

Recommended Courses After

Summer School

Big Data Analysis in the Social Sciences

Introduction to Exploratory Network Analysis

Introduction to Manual and Computer-Assisted Content Analysis

Quantative Text Anaysis

Advanced Social Network Analysis and Visualisation with R

Winter School

Inferential Network Analysis

Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Convenors

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.


Share this page
 

"Politics determines the process of "who gets what, when, and how"" - Harold Lasswell


Back to top