ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Automated Web Data Collection with R

Course Dates and Times

Friday 22 February 13:00–15:00 and 15:30–18:00

Saturday 23 February 09:00–12:30 and 14:00–17:30

Dominic Nyhuis

dominic.nyhuis@gmail.com

Universität Hannover

Web data is rapidly changing empirical social science. Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game.

Countless data sources are easily accessible and the primary challenge is merely one of collecting and managing the available information. What is more, not only well-funded research projects are able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based datasets.

This course introduces the most important tools for collecting and managing web data. It does so by relying on R, which has vastly expanded its web-scraping functionality in recent years. The major advantage of R for web scraping is that all the steps in a research project can be taken in a coherent framework – data collection, data processing, data analysis and visualisation – ensuring a more friction-free research process.

Tasks for ECTS Credits

1 credit (pass/fail grade). Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.


Instructor Bio

Dominic Nyhuis is a Postdoctoral Researcher at Leibniz University Hannover, Chair for Comparative Politics and the Political System of Germany.

Prior to this, he was affiliated with the Universities of Frankfurt, Vienna, and Mainz.

Dominic received his PhD in Political Science from the University of Mannheim. His research focuses on comparative legislative studies, party politics, and municipal politics. Methodologically, his work relies on quantitative methods, web scraping, and automated text analysis.

 @dominic_nyhuis

Data on the web is fundamentally changing research practices in the social sciences. These changes are not least evidenced by the inception of entirely new research fields, developments associated with terms like computational social science, and data science. These developments carry with them an enormous potential, particularly for young scholars without access to major research funds.

By mastering the tools needed for automated web data collection, a single researcher can construct a dataset that would have required tremendous effort and expense not too long ago. What is more, while the tools are not too difficult to master, they are still sufficiently rare that practitioners can claim a unique, highly sought-after skillset – in academia and beyond.

This course will give you an applied overview of the skills required to automatically collect data from the web.

It will provide an introduction to some of the most important skills and techniques. In particular, it introduces the basic structure of HTML to enable an understanding of the underlying architecture and mechanics of websites. XPath will be introduced as a syntax to address specific elements of websites and tools to extract them as needed. Regular expressions are covered which allow further processing textual data gathered from the web. Client-server interactions via HTTP and the structure of URLs are discussed to understand web interactions in practice. APIs are discussed to equip you with the knowledge of how to collect data from dedicated data access points.

The course's applied elements make use of the programming language R. Although several other languages are still more common for the purposes of web scraping, R has come into its own in recent years with the publication of several extensions to support even complex web-scraping tasks. The main advantage of R for web scraping is that the whole research process – from data collection all the way to data wrangling, analysis, visualisation and publication – can be achieved within the same framework. Many social scientists might already have some familiarity with the language from a different context, making the initial steps in web scraping a little less daunting.

By the end of the course, you should have a basic understanding of fundamental web-scraping techniques, and know how to conduct simple web-scraping tasks from static websites you encounter in your research.

If you aim to accomplish more complex tasks, the course should give you a sense of the self-study techniques required to build on what you have already learned here.

Basic familiarity with R is required.

We will briefly review some of the most important concepts in advanced R at the beginning of the course, but you should already know the basics.

Day Topic Details
Friday afternoon 1 Introduction 2 Very brief advanced R refresher 3 HTML + Exercises 4 XPath + Exercises

1 Overview of the course; Introduction to the most important web technologies; Typical web interactions; Accessing web resources from R
2 Ensuring that all participants have the necessary background in R to follow along for the rest of the course
3 Understanding the structure of HTML source code; Exercise: Building a simple website from scratch
4 Learning a syntax for extracting elements from the HTML source code

Saturday morning HTTP HTML and XML XPath

Frequently used web formats and tools that help extract specific pieces of information from a website.

Saturday afternoon 8 Application 1: Scraping the European Parliament website 9 Application 2: Scraping a news website

8 Extracting a dataset with information on legislators from the European Parliament website
9 Extracting a small corpus of news articles

Friday afternoon Basics Regular Expressions

The first session provides an overview of web technologies. It will explore the base capabilities of R to gather data from the web, and to store data and manipulate files. It will also introduce Regular Expressions and how to use them to handle text and extract information.

 

Saturday morning 5 Regular expressions + Exercises 6 HTTP/URLs + Exercises 7 API + Exercises

5 Learning a syntax for extracting pieces of natural text; Introducing the functionality in R for conducting text operations
6 Understanding client-server interactions; Understanding HTTP methods and their relationship to the URL; Understanding the components of URLs; URL manipulation
7 Understanding application programming interfaces; The practical aspects of querying an API

Saturday afternoon Web services and APIs

The third session introduces web services and APIs, what they offer and how we might incorporate them into R.

Day Readings
Friday afternoon

N/A

Saturday morning

N/A

Saturday afternoon

N/A

Software Requirements

R and RStudio

Hardware Requirements

Please bring your own laptop.

Literature

Khalil, Salim and Mohamed Fakir. 2017. RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.

Lawson, Richard. 2015. Web scraping with Python: Scrape data from any website with the Power of Python. Birmingham: Packt.

Mitchell, Ryan. 2015. Web scraping with Python: Collecting more data from the modern web. Sebastopol: O’Reilly.

Munzert, Simon, Christian Rubba, Peter Meißner and Dominic Nyhuis. 2015. Automated web data collection with R: A practical guide to web scraping and text mining. Hoboken: Wiley.

Nolan, Deborah and Duncan Temple Lang. 2014. XML and web technologies for data sciences with R. New York: Springer.

Recommended Courses to Cover Before this One

Introduction to R

Recommended Courses to Cover After this One

Summer School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Winter School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python
Quantitative Text Analysis