Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 29 July – Friday 2 August
14:00–15:30 / 16:00–17:30 (ending slightly earlier on Friday)
The increasing availability of large amounts of data is changing research in political science. Over the past years, a variety of data – whether election results, press releases, parliamentary speeches or social media posts – has become available online. Although data has become easier to find, in most cases, it comes in an unstructured format. This makes collecting, cleaning and analysing this data challenging.
The goal of this class is to equip you to gather online data and process it in R for your own research.
During the course, you will learn to scrape content from different types of webpages, gather information from web interfaces and collect social media data. The course uses R throughout the complete process of downloading, cleaning and reshaping web and social media data for analysis.
While we introduce tools and techniques that help with data collection more generally, the focus will be on three common scenarios:
The course is hands-on, with lectures followed by in-class exercises where you will apply and practice the new methods. If possible, please bring examples from your own research projects for in-class exercises.
ECTS Credits for this course and, below, tasks for additional credits
3 ECTS (graded) As above, plus complete small daily assignments, due before the start of the next class.
4 ECTS (graded) As above, plus submit a small project applying techniques covered in the course to a substantive research question.
Theresa Gessler works at the University of Zurich where she is part of the Digital Democracy Lab and a co-organiser of the Zurich Summer School for Women in Political Methodology, on which she also teaches webscraping.
In her research, she uses text analysis and computational methods, based on data collected from different online and offline sources.
Besides her interest in computational social science, Theresa works on party conflict on the issue of democracy, as well as the transformation of democratic processes through digitalisation.
Data from online sources is gaining importance in different fields of social science. Citizens, parties, organisations as well as states are present online, communicate and leave traces of their activities. Even traditional ‘offline’ data like election results or press releases are increasingly available online, decreasing the costs of data collection. Hence, web and social media data provides a rich source for addressing many political science questions.
This course gives you an overview of the new sources of data now available, and the computational tools required to collect it, clean it, and organise it in a format amenable to statistical analysis.
Day 1
I introduce the basics of webscraping, based on an introduction to webpage structures. In the first session, we learn how to download files (pdf, excel etc.) and scrape data from simple webpages and tables (e.g. electoral results or Wikipedia tables). We also practice how to process these files in R. In the second session, we learn how to select the data we need, based on regular expressions and web page formatting (CSS Selectors and xpaths).
Day 2
Dedicated to gathering data through requests. Increasingly, institutions, newspapers and organisations make their data available for research through databases or APIs. In the first session, we learn how to query online databases automatically. In the second session, we focus on APIs and RSS Feeds. We learn how to formulate queries to APIs using HTTP and how to read the resulting data, which is often in JSON or XML format. This will also be the basis for the focus on social media on the fourth day, given that many social media pages use similar request structures and data formats.
Day 3
We focus on automating what we have learned during the first two days. In the first session, we discuss how to implement loops and functions in R to gather data more effectively. This is particularly useful when scraping data that is spread across multiple pages and requires following many hyperlinks. In the second session, we focus on scraping dynamic pages with RSelenium; that is, pages that reload after scrolling or change upon user interaction. We learn how to navigate a browser from within R to automate these interactions.
Day 4
Social media data. The first session gives an overview of available social media data, also highlighting lesser known platforms. The second session will focus on Twitter. We will use the Streaming API which collects tweets filtered by keywords and locations, as well as the REST API which collects the tweets of specific users. NB: several social media platforms are revising their rules on data access, so the exact content of this day may change closer to the time.
Day 4
We practice how to implement reproducible and reusable workflows for scraping. Especially when gathering data from different sources or repeatedly updating data collections, getting organised can make your work more efficient. In the second session, we’ll learn how to tackle common problems related to data cleaning, dealing with encodings and generally making the data suitable for analysis. If participants are interested, we can also use this final session to discuss specific webscraping challenges in students’ current projects.
All sessions will include a lecture as well as practical sessions with exercises that apply the learned techniques to new data sources. While we will not focus on data analysis, we will use simple methods to understand the data we gather.
Independent of your previous experience with automated data collection, you will leave the course with a comprehensive understanding of the web and social data available for social science research, and you will be able to use this data for your own research.
The course requires some familiarity with the R statistical programming language. You should know how to:
If you do not know any of these things, take Akos Mate's R Basics.
More advanced knowledge of R, such as writing functions and loops or familiarity with tidyverse, is helpful but not essential.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least three hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for three hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
1 | Introduction to Webscraping |
Session 1 Introduction to the course & basics of webscraping (scraping simple pages, tables, downloading and processing files). Session 2 Extraction of content with regular expressions, CSS Selectors and XPATH. |
2 | Collecting data with requests |
Session 1 Collecting data behind forms. Session 2 Using APIs and RSS Feeds. |
3 | Automation |
Session 1 Scraping data from multiple pages using loops and functions in R Session 2 Scraping dynamic pages with RSelenium |
4 | Social Media Data |
Session 1 Introduction to and Overview of Types of Social Media Data. Session 2 Focus on Twitter |
5 | Workflow and Advanced Topics |
Session 1 How to write reproducible & reusable code. Session 2 Common challenges (data cleaning, dealing with encoding. |
Day | Readings |
---|---|
1 |
No challenges assigned. |
2 |
Challenge Scraping the American President Project website |
3 |
Challenge Scraping articles from The Guardian |
4 |
Challenge Analysing the Amnesty International Annual Report |
5 |
Challenge Content analysis of a politician’s Twitter feed |
Please bring your own laptop.
Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014)
Automated data collection with R: A practical guide to web scraping and text mining
John Wiley & Sons.
González-Bailón, S. (2017)
Decoding the Social World
MIT Press
Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. (2017)
Measuring public opinion with social media data
In The Oxford Handbook of Polling and Survey Methods
Salganik, M. (2017)
Bit by Bit: Social Research in the Digital Age
Princeton University Press.
Steinert-Threlkeld, Z. (2018)
Twitter as Data
Cambridge University Press.
Summer School
R Basics
Effective Data Management with R
Summer School
Big Data Analysis in the Social Sciences
Introduction to Exploratory Network Analysis
Introduction to Manual and Computer-Assisted Content Analysis
Quantative Text Anaysis
Advanced Social Network Analysis and Visualisation with R
Winter School
Inferential Network Analysis