ECPR Winter School
University of Bamberg, Bamberg
2 - 9 March 2018




WA105 - Automated Web Data Collection with R

Instructor Details

Dominic Nyhuis

Institution:
Johann Wolfgang Goethe-Universität Frankfurt

Instructor Bio

Dominic Nyhuis is Researcher at the Department of Social Sciences, Goethe University Frankfurt. Prior to this, he was affiliated with the Universities of Vienna, Mannheim, and Mainz. His research focuses on comparative legislative studies with a particular emphasis on questions of representation, small-area policy preferences, and municipal politics. He is also interested in techniques for automated data collection and quantitative methods for the social sciences.


Course Dates and Times

Friday 2 March
13:00-15:00 and 15:30-17:00

Saturday 3 March
09:00-10:30 / 11:00-12:00 and 13:00-14:30
 

Prerequisite Knowledge

An intermediate knowledge of R is a prerequisite for the course, as there is no time for an R refresher. The following questions can help you assess whether your familiarity with R suffices to follow the course:

  • Do you know how to subsets vectors, data frames and lists?
  • Can you transform lists into data frames?
  • Are you familiar with the apply functionality?
  • Can you write a loop?
  • Can you write a function?

If the answer to all of these is yes, you are well prepared for the course. If the answer is no, there are many well-written introductions to R.

  • Crawley, Michael J. 2012. The R Book, 2nd Edition. Hoboken, NJ: John Wiley & Sons.
  • Adler, Joseph. 2009. R in a Nutshell. A Desktop Quick Reference. Sebastopol, CA: OReilly.
  • Teetor, Paul. 2011. R Cookbook. Sebastopol, CA: OReilly.

Besides these, there are introductions and manuals at http://cran.r-project.org/manuals.html or http://cran.r-project.org/other-docs. A fine online tutorial is available at http://tryr.codeschool.com/. Additionally, Quick-R (http://www.statmethods.net/) is a good reference for many basic commands. You can also find a lot of free resources and examples at http://www.ats.ucla.edu/stat/r/ and if you want to dive deeper into R, have a look at Hadley Wickham’s page at http://adv-r.had.co.nz/.

Short Outline

The increasing availability of online data is rapidly changing empirical social science. Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game. Countless sources of human behaviour are easily accessible and the primary challenge is one of collecting and managing the available information. What is more, not only are large research projects able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based data sets.

This course introduces some of the techniques necessary to collect and manage web data. It does so by relying on R, which has vastly extended its functionality for performing web scraping in recent years. The major advantage of R in web scraping is that all the steps in a research project can be taken in a common software solution – data collection, data processing, data analysis and visualization –, ensuring a more friction-free research process. At the end of the course, you should be able to extract information from websites to build your own data set of, e.g., press releases, news articles, or parliamentary speeches.

Long Course Outline

Automated web data collection with R is a course designed to provide a primer on web scraping with R, i.e. the automated gathering of data and text from the web. Web scraping can enable new perspectives on social science research problems, but it can even help to develop new subjects altogether by making data available that just recently came into being – e.g., social media and blog posts or webpage linkage networks. Similarly, web scraping pushes the scales on the amount of data available to researchers.

Although web scraping is not new, it is only starting to be recognized by a broad social science audience. In the past, statistical software used by social scientists like SPSS, Stata or R did not have web scraping capabilities, while software that had the means for web scraping like PHP, Pearl or Python was alien to social scientists. Furthermore, web scraping tends to be presented either purely technical in the form of ‘from programmers for programmers’ manuals or as specialized single-issue case studies that make it hard to get the general picture. Being well aware that social scientists usually want to do substantive research instead of learning yet another software, the course builds on a thorough but gentle, hands on web scraping textbook written by social scientists for social scientists: Automated Data Collection with R - A Practical Guide to Web Scraping and Text Mining. The book rests on three pillars: (1) providing an introduction to web technologies and associated tools, (2) presenting handy R packages and guidance for web scraping and (3) illustrating real life applications with case studies.

The main goal of the course is to provide a solid overview on the most important web technologies, how they are connected and which tools are available in R to handle them. After completing the course, you should be able to handle simple scraping scenarios: downloading various file formats, extracting lists and tables from HTML documents or retrieving easily accessible information.

Day-to-Day Schedule

Day 
Topic 
Details 
Friday afternoonBasics Regular Expressions

The first session provides an overview of web technologies. It will explore the base capabilities of R to gather data from the web, and to store data and manipulate files. It will also introduce Regular Expressions and how to use them to handle text and extract information.

Saturday morningHTTP HTML and XML XPath

The second session deals with frequently used web formats and tools that help extract specific pieces of information from a website.

Saturday afternoonWeb services and APIs

The third session introduces web services and APIs, what they offer and how we might incorporate them into R.

Software Requirements

R and RStudio

Hardware Requirements

Participants to bring own laptops.

Literature

Munzert, Simon, Rubba, Christian, Meißner, Peter, and Dominic Nyhuis (2014): Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Wiley

Barberá, Pablo, 2015: Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data. Political Analysis 23:76–91.

Barberá, Pablo, and Gonzalo Rivero, 2014: Understanding the Political Representativeness of Twitter users. Social Science Computer Review 1–18.

Chadefaux, Tomas, 2014: Early warning signals for war in the news. Journal of Peace Research 51:5–18.

Gayo-Avello, Daniel, 2013: A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data. Social Science Computer Review 31:649–679.

Gill, Michael, and Arthur Spirling, 2015: Estimating the Severity of WikiLeaks U.S. Diplomatic Cables Disclosure. Political Analysis 23:299–305.

Gohdes, Anita R., 2015: Pulling the plug: Network disruptions and violence in civil confict. Journal of

Peace Research 52:1–16.

Hassanpour, Navid, 2013: Tracking the Semantics of Politics: A Case for Online Data Research in Political Science. PS: Political Science & Politics 46:299–306.

King, Gary, Jennifer Pan, and Margaret E. Roberts, 2013: How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review 107:326–343.

Rød, Espen Geelmuyden, and Nils B. Weidmann, 2015: Empowering activists or autocrats? The Internet in authoritarian regimes. Journal of Peace Research 52:1–14.

Sagi, Eyal, and Morteza Dehghani, 2014: Measuring Moral Rhetoric in Text. Social Science Computer Review 32:132–144.

Shaw, Aaron, and Bejamin Mako Hill, 2014: Laboratories of Oligarchy? How The Iron Law Extends to Peer Production. Journal of Communication 64:215–238.

Street, Alex, Tomas A. Murray, John Blitzer, and Rajan S. Patel, 2015: Estimating Voter Registration Deadline Effects with Web Search Data. Political Analysis 1–2.

Zeitzof, Tomas, 2011: Using Social Media to Measure Confict Dynamics: An Application to the 2008 - 2009 Gaza Confict. Journal of Confict Resolution 55:938–969.

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

Introduction to R

Recommended Courses After

Summer School

Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

 

Winter School

Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python

Quantitative Text Analysis

Additional Information

Disclaimer

The information contained in this course description form may be subject to subsequent adaptations (e.g. taking into account new developments in the field, specific participant demands, group size etc.). Registered participants will be informed in due time in case of adaptations.

Note from the Academic Convenors

By registering to this course, you certify that you possess the prerequisite knowledge that is requested to be able to follow this course. The instructor will not teach these prerequisite items. If you are not sure if you possess this knowledge to a sufficient level, we suggest you contact the instructor before you proceed with your registration.


Share this page
 

"Man is by nature a political animal" - Aristotle


Back to top