ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Automated Collection of Web and Social Data

Course Dates and Times

Monday 29 July – Friday 2 August

14:00–15:30 / 16:00–17:30 (ending slightly earlier on Friday)

Theresa Gessler

gessler@europa-uni.de

Europa-Universität Viadrina

The increasing availability of large amounts of data is changing research in political science. Over the past years, a variety of data – whether election results, press releases, parliamentary speeches or social media posts – has become available online. Although data has become easier to find, in most cases, it comes in an unstructured format. This makes collecting, cleaning and analysing this data challenging.

The goal of this class is to equip you to gather online data and process it in R for your own research.

During the course, you will learn to scrape content from different types of webpages, gather information from web interfaces and collect social media data. The course uses R throughout the complete process of downloading, cleaning and reshaping web and social media data for analysis.

While we introduce tools and techniques that help with data collection more generally, the focus will be on three common scenarios:

  • automating the collection of data spread over multiple pages or available behind forms
  • interacting with APIs and RSS feeds provided by webpages of institutions, companies and organisations
  • collecting social media data.

The course is hands-on, with lectures followed by in-class exercises where you will apply and practice the new methods. If possible, please bring examples from your own research projects for in-class exercises.

ECTS Credits for this course and, below, tasks for additional credits

3 ECTS (graded) As above, plus complete small daily assignments, due before the start of the next class.

4 ECTS (graded) As above, plus submit a small project applying techniques covered in the course to a substantive research question.


Instructor Bio

Theresa Gessler works at the University of Zurich where she is part of the Digital Democracy Lab and a co-organiser of the Zurich Summer School for Women in Political Methodology, on which she also teaches webscraping.

In her research, she uses text analysis and computational methods, based on data collected from different online and offline sources.

Besides her interest in computational social science, Theresa works on party conflict on the issue of democracy, as well as the transformation of democratic processes through digitalisation.

Twitter  @th_ges

Data from online sources is gaining importance in different fields of social science. Citizens, parties, organisations as well as states are present online, communicate and leave traces of their activities. Even traditional ‘offline’ data like election results or press releases are increasingly available online, decreasing the costs of data collection. Hence, web and social media data provides a rich source for addressing many political science questions.

This course gives you an overview of the new sources of data now available, and the computational tools required to collect it, clean it, and organise it in a format amenable to statistical analysis.


Day 1
I introduce the basics of webscraping, based on an introduction to webpage structures. In the first session, we learn how to download files (pdf, excel etc.) and scrape data from simple webpages and tables (e.g. electoral results or Wikipedia tables). We also practice how to process these files in R. In the second session, we learn how to select the data we need, based on regular expressions and web page formatting (CSS Selectors and xpaths).

Day 2
Dedicated to gathering data through requests. Increasingly, institutions, newspapers and organisations make their data available for research through databases or APIs. In the first session, we learn how to query online databases automatically. In the second session, we focus on APIs and RSS Feeds. We learn how to formulate queries to APIs using HTTP and how to read the resulting data, which is often in JSON or XML format. This will also be the basis for the focus on social media on the fourth day, given that many social media pages use similar request structures and data formats.

Day 3
We focus on automating what we have learned during the first two days. In the first session, we discuss how to implement loops and functions in R to gather data more effectively. This is particularly useful when scraping data that is spread across multiple pages and requires following many hyperlinks. In the second session, we focus on scraping dynamic pages with RSelenium; that is, pages that reload after scrolling or change upon user interaction. We learn how to navigate a browser from within R to automate these interactions.

Day 4
Social media data. The first session gives an overview of available social media data, also highlighting lesser known platforms. The second session will focus on Twitter. We will use the Streaming API which collects tweets filtered by keywords and locations, as well as the REST API which collects the tweets of specific users. NB: several social media platforms are revising their rules on data access, so the exact content of this day may change closer to the time.

Day 4
We practice how to implement reproducible and reusable workflows for scraping. Especially when gathering data from different sources or repeatedly updating data collections, getting organised can make your work more efficient. In the second session, we’ll learn how to tackle common problems related to data cleaning, dealing with encodings and generally making the data suitable for analysis. If participants are interested, we can also use this final session to discuss specific webscraping challenges in students’ current projects.


All sessions will include a lecture as well as practical sessions with exercises that apply the learned techniques to new data sources. While we will not focus on data analysis, we will use simple methods to understand the data we gather.

Independent of your previous experience with automated data collection, you will leave the course with a comprehensive understanding of the web and social data available for social science research, and you will be able to use this data for your own research.

The course requires some familiarity with the R statistical programming language. You should know how to:

  • read datasets into R
  • work with data frames
  • access help files
  • run basic statistical analyses.

If you do not know any of these things, take Akos Mate's R Basics.

More advanced knowledge of R, such as writing functions and loops or familiarity with tidyverse, is helpful but not essential.

Day Topic Details
1 Introduction to Webscraping

Session 1

Introduction to the course & basics of webscraping (scraping simple pages, tables, downloading and processing files).

Session 2

Extraction of content with regular expressions, CSS Selectors and XPATH.

2 Collecting data with requests

Session 1

Collecting data behind forms.

Session 2

Using APIs and RSS Feeds.

3 Automation

Session 1

Scraping data from multiple pages using loops and functions in R

Session 2

Scraping dynamic pages with RSelenium

4 Social Media Data

Session 1

Introduction to and Overview of Types of Social Media Data.

Session 2

Focus on Twitter

5 Workflow and Advanced Topics

Session 1

How to write reproducible & reusable code.

Session 2

Common challenges (data cleaning, dealing with encoding.

Day Readings
1

No challenges assigned.

2

Challenge Scraping the American President Project website

3

Challenge Scraping articles from The Guardian

4

Challenge Analysing the Amnesty International Annual Report

5

Challenge Content analysis of a politician’s Twitter feed

Software Requirements

  • R and an internet browser.
  • Most required packages (e.g. rvest) can be installed in advance or during the course. A definitive list will be sent around shortly before the course to allow you to prepare.
  • For running RSelenium, you need Java installed on your computer.
  • If you use Windows, I recommend installing RTools, which allows you to build new versions of packages from Github.
  • If you want to follow the part on scraping Twitter, create an account and a Twitter App beforehand.

Hardware Requirements

Please bring your own laptop.

Literature

Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014)
Automated data collection with R: A practical guide to web scraping and text mining
John Wiley & Sons.

González-Bailón, S. (2017)
Decoding the Social World
MIT Press

Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. (2017)
Measuring public opinion with social media data
In The Oxford Handbook of Polling and Survey Methods

Salganik, M. (2017)
Bit by Bit: Social Research in the Digital Age
Princeton University Press.

Steinert-Threlkeld, Z. (2018)
Twitter as Data
Cambridge University Press.

Recommended Courses to Cover Before this One

Summer School

R Basics

Effective Data Management with R

Recommended Courses to Cover After this One

Summer School

Big Data Analysis in the Social Sciences

Introduction to Exploratory Network Analysis

Introduction to Manual and Computer-Assisted Content Analysis

Quantative Text Anaysis

Advanced Social Network Analysis and Visualisation with R

Winter School

Inferential Network Analysis