ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Automated Web Data Collection with R

Course Dates and Times

Friday 14 February 13:00–15:00 and 15:30–18:00

Saturday 15 February 09:00–12:30 and 14:00–17:30

Theresa Gessler

gessler@europa-uni.de

Europa-Universität Viadrina

The increasing availability of large amounts of data is changing research in political science. Over the past years, a variety of data – whether election results, press releases, parliamentary speeches or social media posts – has become available online. Although data has become easier to find, in most cases, it comes in an unstructured format. This makes collecting, cleaning and analysing this data challenging.

The goal of this class is to equip you to gather online data and process it in R for your own research. This is based on R, which has vastly expanded its web-scraping functionality in recent years. The major advantage of R for web scraping is that all the steps in a research project can be taken in a coherent framework – data collection, data processing, data analysis and visualisation.

The course is hands-on, with lectures followed by in-class exercises where you will be able to apply and practice the new methods. If possible, please bring examples from your own research projects to work on during in-class exercises.

Tasks for ECTS Credits

1 credit (pass/fail grade). Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.


Instructor Bio

Theresa Gessler works at the University of Zurich where she is part of the Digital Democracy Lab and a co-organiser of the Zurich Summer School for Women in Political Methodology, on which she also teaches webscraping.

In her research, she uses text analysis and computational methods, based on data collected from different online and offline sources.

Besides her interest in computational social science, Theresa works on party conflict on the issue of democracy, as well as the transformation of democratic processes through digitalisation.

Twitter  @th_ges

Data from online sources is gaining importance in different fields of social science: citizens, parties, organisations as well as states are present online, communicate and leave traces of their activities. Even traditional ‘offline’ data like election results or press releases are increasingly available online, decreasing the costs of data collection. Hence, web and social media data provides a rich source for addressing many political science questions. This course will provide you with an overview of the new sources of data now available, and the computational tools required to collect the data, clean it, and organise it in a format amenable to statistical analysis.

The course will give you an applied overview of the skills required to automatically collect data from the web, including many exercises to do during class.

The afternoon of the first day will introduce the basics of webscraping, based on an introduction to webpage structures. We will learn how to download files (pdf, excel etc.) and scrape data from simple webpages and tables (e.g. electoral results or Wikipedia tables). We also practice how to process these files in R. In the second session, we will learn how to select the data we need, based on regular expressions and web page formatting (CSS Selectors and xpaths).

The morning of the second day will focus on automating what we have learned during the first day: We will discuss how to implement loops and functions in R to gather data more effectively. This is particularly useful when scraping data that is spread across multiple pages and requires following many hyperlinks. The course will also give you an overview of more advanced webscraping techniques and guidance on when to use them. That includes APIs (through which institutions, newspapers and organisations increasingly make their data available for research) and RSS Feeds (used primarily by news pages), APIs of social media pages and the scraping of dynamic pages (that is, pages that reload after scrolling or change upon user interaction) with RSelenium.

Finally, in the afternoon, we will practice how to implement reproducible and reusable workflows for scraping. Especially when gathering data from different sources or repeatedly updating data collections, getting organised can make your work more efficient. This will also give you the opportunity to put your new skills into practice.

All sessions will include a lecture as well as practical sessions with exercises that apply the learned techniques to new data sources. While we will not focus on data analysis, we will use simple methods to understand the data we gather.

Independent of your previous experience with automated data collection, you will leave this course with a solid understanding of the web and social data available for social science research and will know how to process it for your own research.

Basic familiarity with R is required. You should know how to:

•    read datasets into R
•    work with data frames
•    access help files
•    run basic statistical analyses.

We will briefly review some of the most important concepts in advanced R at the beginning of the course like loops and functions but you should already know the basics.

Day Topic Details
Friday afternoon Session 1: Introduction to the course and basics of webscraping (html, scraping simple pages, tables, downloading and processing files) Session 2: Extraction of content with CSS Selectors
Saturday afternoon Exercises – these may be examples provided by the Instructor or examples from students' own research projects
Saturday morning Scraping data from multiple pages using loops and functions in R Overview of other techniques (RSS feeds, APIs, Selenium, social media data
Day Readings

Software Requirements

R and RStudio

Most required packages (eg rvest) can be installed in advance or during the course. A definitive list will be emailed to you shortly before the course, to allow you to prepare. 

Hardware Requirements

Please bring your own laptop.

Literature

Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2015)
Automated data collection with R: A practical guide to web scraping and text mining
John Wiley & Sons.

González-Bailón, S. (2017)
Decoding the Social World
MIT Press

Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. (2017)
Measuring public opinion with social media data
In The Oxford Handbook of Polling and Survey Methods

Salganik, M. (2017)
Bit by Bit: Social Research in the Digital Age
Princeton University Press

Steinert-Threlkeld, Z. (2018)
Twitter as Data
Cambridge University Press

Recommended Courses to Cover Before this One

Introduction to R

Recommended Courses to Cover After this One

Summer School 

Big Data Analysis in the Social Sciences

Introduction to Exploratory Network Analysis

Introduction to Manual and Computer-Assisted Content Analysis

Quantitative Text Analysis

Advanced Social Network Analysis and Visualisation with R

Winter School

Introduction to Quantitative Text Analysis

Introduction to Applied Social Network Analysis

Introduction to Machine Learning for the Social Sciences