ECPR Winter School
University of Bamberg, Bamberg
22 February - 1 March 2019




WA105 - Automated Web Data Collection with R

Instructor Details

Dominic Nyhuis

Institution:
Universität Hannover

Instructor Bio

Dominic Nyhuis is a Postdoctoral Researcher at Leibniz University Hannover, Chair for Comparative Politics and the Political System of Germany.

Prior to this, he was affiliated with the Universities of Frankfurt, Vienna, and Mainz.

Dominic received his PhD in Political Science from the University of Mannheim. His research focuses on comparative legislative studies, party politics, and municipal politics. Methodologically, his work relies on quantitative methods, web scraping, and automated text analysis.

 @dominic_nyhuis


Course Dates and Times

Friday 22 February 13:00–15:00 and 15:30–18:00

Saturday 23 February 09:00–12:30 and 14:00–17:30

Prerequisite Knowledge

Basic familiarity with R is required.

We will briefly review some of the most important concepts in advanced R at the beginning of the course, but you should already know the basics.

Short Outline

Web data is rapidly changing empirical social science. Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game.

Countless data sources are easily accessible and the primary challenge is merely one of collecting and managing the available information. What is more, not only well-funded research projects are able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based datasets.

This course introduces the most important tools for collecting and managing web data. It does so by relying on R, which has vastly expanded its web-scraping functionality in recent years. The major advantage of R for web scraping is that all the steps in a research project can be taken in a coherent framework – data collection, data processing, data analysis and visualisation – ensuring a more friction-free research process.

Tasks for ECTS Credits

1 credit (pass/fail grade). Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.

Long Course Outline

Data on the web is fundamentally changing research practices in the social sciences. These changes are not least evidenced by the inception of entirely new research fields, developments associated with terms like computational social science, and data science. These developments carry with them an enormous potential, particularly for young scholars without access to major research funds.

By mastering the tools needed for automated web data collection, a single researcher can construct a dataset that would have required tremendous effort and expense not too long ago. What is more, while the tools are not too difficult to master, they are still sufficiently rare that practitioners can claim a unique, highly sought-after skillset – in academia and beyond.

This course will give you an applied overview of the skills required to automatically collect data from the web.

It will provide an introduction to some of the most important skills and techniques. In particular, it introduces the basic structure of HTML to enable an understanding of the underlying architecture and mechanics of websites. XPath will be introduced as a syntax to address specific elements of websites and tools to extract them as needed. Regular expressions are covered which allow further processing textual data gathered from the web. Client-server interactions via HTTP and the structure of URLs are discussed to understand web interactions in practice. APIs are discussed to equip you with the knowledge of how to collect data from dedicated data access points.

The course's applied elements make use of the programming language R. Although several other languages are still more common for the purposes of web scraping, R has come into its own in recent years with the publication of several extensions to support even complex web-scraping tasks. The main advantage of R for web scraping is that the whole research process – from data collection all the way to data wrangling, analysis, visualisation and publication – can be achieved within the same framework. Many social scientists might already have some familiarity with the language from a different context, making the initial steps in web scraping a little less daunting.

By the end of the course, you should have a basic understanding of fundamental web-scraping techniques, and know how to conduct simple web-scraping tasks from static websites you encounter in your research.

If you aim to accomplish more complex tasks, the course should give you a sense of the self-study techniques required to build on what you have already learned here.

Day-to-Day Schedule

Day 
Topic 
Details 
Friday afternoon1 Introduction
2 Very brief advanced R refresher
3 HTML + Exercises
4 XPath + Exercises

1 Overview of the course; Introduction to the most important web technologies; Typical web interactions; Accessing web resources from R
2 Ensuring that all participants have the necessary background in R to follow along for the rest of the course
3 Understanding the structure of HTML source code; Exercise: Building a simple website from scratch
4 Learning a syntax for extracting elements from the HTML source code

Saturday morning5 Regular expressions + Exercises
6 HTTP/URLs + Exercises
7 API + Exercises

5 Learning a syntax for extracting pieces of natural text; Introducing the functionality in R for conducting text operations
6 Understanding client-server interactions; Understanding HTTP methods and their relationship to the URL; Understanding the components of URLs; URL manipulation
7 Understanding application programming interfaces; The practical aspects of querying an API

Saturday afternoon8 Application 1: Scraping the European Parliament website
9 Application 2: Scraping a news website

8 Extracting a dataset with information on legislators from the European Parliament website
9 Extracting a small corpus of news articles

Day-to-Day Reading List

Day 
Readings 
Friday afternoon

N/A

Saturday morning

N/A

Saturday afternoon

N/A

Software Requirements

R and RStudio

Hardware Requirements

Please bring your own laptop.

Literature

Khalil, Salim and Mohamed Fakir. 2017. RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.

Lawson, Richard. 2015. Web scraping with Python: Scrape data from any website with the Power of Python. Birmingham: Packt.

Mitchell, Ryan. 2015. Web scraping with Python: Collecting more data from the modern web. Sebastopol: O’Reilly.

Munzert, Simon, Christian Rubba, Peter Meißner and Dominic Nyhuis. 2015. Automated web data collection with R: A practical guide to web scraping and text mining. Hoboken: Wiley.

Nolan, Deborah and Duncan Temple Lang. 2014. XML and web technologies for data sciences with R. New York: Springer.

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

Introduction to R

Recommended Courses After

Summer School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Winter School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python
Quantitative Text Analysis

Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Convenors

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.


Share this page
 

"Politics determines the process of "who gets what, when, and how"" - Harold Lasswell


Back to top