ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Automated Web Data Collection with R

Course Dates and Times

Friday 2 March
13:00–15:00 and 15:30–17:00

Saturday 3 March
09:00–10:30 / 11:00–12:00 and 13:00–14:30

Dominic Nyhuis

dominic.nyhuis@gmail.com

Universität Hannover

The increasing availability of online data is rapidly changing empirical social science.

Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game. Countless sources of human behaviour are easily accessible and the primary challenge is one of collecting and managing the available information. What is more, not only are large research projects able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based data sets.

This course introduces some of the techniques necessary to collect and manage web data.

It does so by relying on R, which has vastly extended its functionality for performing web scraping in recent years. The major advantage of R in web scraping is that all the steps in a research project can be taken in a common software solution – data collection, data processing, data analysis and visualisation – ensuring a more friction-free research process.

At the end of the course, you should be able to extract information from websites to build your own data set of (for example) press releases, news articles, or parliamentary speeches.


Instructor Bio

Dominic Nyhuis is a Postdoctoral Researcher at Leibniz University Hannover, Chair for Comparative Politics and the Political System of Germany.

Prior to this, he was affiliated with the Universities of Frankfurt, Vienna, and Mainz.

Dominic received his PhD in Political Science from the University of Mannheim. His research focuses on comparative legislative studies, party politics, and municipal politics. Methodologically, his work relies on quantitative methods, web scraping, and automated text analysis.

 @dominic_nyhuis

This course is a primer on web scraping with R, i.e. the automated gathering of data and text from the web.

Web scraping can enable new perspectives on social science research problems, but it can even help to develop new subjects altogether by making data available that only recently came into being, such as social media and blog posts or webpage linkage networks. Similarly, web scraping pushes the scales on the amount of data available to researchers.

Although web scraping is not new, it is only starting to be recognised by a broad social science audience. In the past, statistical software used by social scientists like SPSS, Stata or R did not have web scraping capabilities, while software that could scrape the web, like PHP, Pearl or Python, was alien to social scientists.

What's more, web scraping tends to be presented either purely technical in the form of ‘from programmers for programmers’ manuals or as specialised single-issue case studies that make it hard to get the general picture.

Being well aware that social scientists usually want to do substantive research instead of learning yet another piece of software, the course builds on a thorough but gentle, hands-on web scraping textbook written by social scientists for social scientists: Automated Data Collection with R – A Practical Guide to Web Scraping and Text Mining.

The book rests on three pillars:

  1. providing an introduction to web technologies and associated tools
  2. presenting handy R packages and guidance for web scraping
  3. illustrating real life applications with case studies.

Our main goal is to provide a solid overview of the most important web technologies, how they are connected and which tools are available in R to handle them.

After completing this course, you should be able to handle simple scraping scenarios: downloading a variety of file formats, extracting lists and tables from HTML documents, and retrieving easily accessible information.

Intermediate knowledge of R.

  • Do you know how to subsets vectors, data frames and lists?
  • Can you transform lists into data frames?
  • Are you familiar with the apply functionality?
  • Can you write a loop?
  • Can you write a function?

If the answer to all of these is yes, you are well prepared. If the answer is no, there are many well-written introductions.

  • Crawley, Michael J. 2012. The R Book, 2nd Edition. Hoboken, NJ: John Wiley & Sons.
  • Adler, Joseph. 2009. R in a Nutshell. A Desktop Quick Reference. Sebastopol, CA: O’Reilly.
  • Teetor, Paul. 2011. R Cookbook. Sebastopol, CA: O’Reilly.

Online resources

Day Topic Details
1 Introduction to ‘data science’ and Python programming
Friday afternoon Basics Regular Expressions

An overview of web technologies. We will explore the base capabilities of R to gather data from the web, and to store data and manipulate files.

I will introduce Regular Expressions and how to use them to handle text and extract information.

Saturday morning HTTP HTML and XML XPath

Frequently used web formats and tools that help extract specific pieces of information from a website.

Saturday afternoon Web services and APIs

Web services and APIs, what they offer, and how we might incorporate them into R.

Software Requirements

R and RStudio

Hardware Requirements

Please bring your own laptop.

Literature

Munzert, Simon, Rubba, Christian, Meißner, Peter, and Dominic Nyhuis (2014): Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Wiley

Barberá, Pablo, 2015: Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data. Political Analysis 23:76–91.

Barberá, Pablo, and Gonzalo Rivero, 2014: Understanding the Political Representativeness of Twitter users. Social Science Computer Review 1–18.

Chadefaux, Tomas, 2014: Early warning signals for war in the news. Journal of Peace Research 51:5–18.

Gayo-Avello, Daniel, 2013: A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data. Social Science Computer Review 31:649–679.

Gill, Michael, and Arthur Spirling, 2015: Estimating the Severity of WikiLeaks U.S. Diplomatic Cables Disclosure. Political Analysis 23:299–305.

Gohdes, Anita R., 2015: Pulling the plug: Network disruptions and violence in civil confict. Journal of Peace Research 52:1–16.

Hassanpour, Navid, 2013: Tracking the Semantics of Politics: A Case for Online Data Research in Political Science. PS: Political Science & Politics 46:299–306.

King, Gary, Jennifer Pan, and Margaret E. Roberts, 2013: How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review 107:326–343.

Rød, Espen Geelmuyden, and Nils B. Weidmann, 2015: Empowering activists or autocrats? The Internet in authoritarian regimes. Journal of Peace Research 52:1–14.

Sagi, Eyal, and Morteza Dehghani, 2014: Measuring Moral Rhetoric in Text. Social Science Computer Review 32:132–144.

Shaw, Aaron, and Bejamin Mako Hill, 2014: Laboratories of Oligarchy? How The Iron Law Extends to Peer Production. Journal of Communication 64:215–238.

Street, Alex, Tomas A. Murray, John Blitzer, and Rajan S. Patel, 2015: Estimating Voter Registration Deadline Effects with Web Search Data. Political Analysis 1–2.

Zeitzof, Tomas, 2011: Using Social Media to Measure Confict Dynamics: An Application to the 2008 – 2009 Gaza Confict. Journal of Confict Resolution 55:938–969.

Recommended Courses to Cover Before this One

Introduction to R

Recommended Courses to Cover After this One

Summer School

Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Winter School

Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Quantitative Text Analysis