ECPR Winter School
University of Bamberg, Bamberg
26 February - 4 March 2016




WC104 - Automated Web Data Collection with R

Instructor Details

Instructor Photo

Peter Meissner

Institution:
Universität Konstanz

Instructor Bio

Peter is a political scientist with a strong methodological and technical background and currently works for the “Comparative Parliamentary Politics” working group at University of Konstanz within the “Institutional Design in European Parliaments” project where he takes care of software development, quality assurance and data management. His publications cover political careers, parliamentary rule changes and web and text data collection. Furthermore he has written software packages for R namely wikipediatrend for accessing page view statistics of Wikipedia articles as well as diffr a package for comparing texts and assessing their differences. Peter tweets at https://twitter.com/marvin_dpr and can be visited online at http://pmeissner.com.


Course Dates and Times

Monday 29 February to Friday 4 March 2016
Generally classes are either 09:00-12:30 or 14:00-17:30
15 hours over 5 days

Prerequisite Knowledge

For the course a no less than intermediate practical knowledge of R is an indispensable prerequisite, as there is no time for an R refresher. The following questions help you to find out if your level of R knowledge suffices to follow the course.

  • Do you know how to subsets vectors, data frames and lists?
  • Can you transform lists into data frames, matrices into vectors and factors into character?
  • Are you familiar with the apply functionality - apply, lapply, sapply, mapply?
  • Can you write a loop?
  • Can you write a function?

If the answer to all of these questions is yes, you are well prepared to get the best out of this course. If you are not too familiar with R yet but still want to join the course, there are many well-written books on the market that provide great introductions to R:

Crawley, Michael J. 2012. The R Book, 2nd Edition. Hoboken, NJ: John Wiley & Sons.

Adler, Joseph. 2009. R in a Nutshell. A Desktop Quick Reference. Sebastopol, CA: O’Reilly.

Teetor, Paul. 2011. R Cookbook. Sebastopol, CA: O’Reilly.

Besides these commercial sources, there is also a lot of free information on the internet. Visit http://cran.r-project.org/manuals.html and http://cran.r-project.org/other-docs.html to get R introductions and manuals. A truly amazing online tutorial for absolute beginners by the Code School is available at http://tryr.codeschool.com/. Additionally, Quick-R (http://www.statmethods.net/) is a good reference site for many basic commands. You can also find a lot of free resources and examples at http://www.ats.ucla.edu/stat/r/ and if you want to dive deeper into R, have a look at Hadley Wickham’s page at http://adv-r.had.co.nz/. To take the most out of the course it is recommended to have a small real live problem you would like to work on.

Short Outline

Are you interested in the analysis of social media data? Do you want to extract information from websites to build your own data set of, e.g., press releases, online newspaper headlines, parliamentary speeches, or politicians’ biographies? The rapid growth of the World Wide Web over the past two decades made firms, public institutions and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behaviour. Along with the triumphant entry of the World Wide Web, we have witnessed a second trend, the increasing popularity and power of open source software like R. For quantitative social scientists, R is among the most important statistical software. It is growing rapidly due to an active community that constantly publishes new packages. An extraordinarily useful feature of R is that we can use it for collecting, extracting, cleaning and storing data from web sources. In combination with R’s well known strengths in data analyses, data manipulation and data visualization, R facilitates staying in a familiar programming environment throughout the course of the research and helps researchers to focus on substantive problems instead of investing too much time in learning other software like PHP, Python or Pearl.

Long Course Outline

Automated web data collection with R is a course designed to give you a primer on web scraping with R. Web scraping is a general term for all kinds of activities that involve (automated) gathering of data and texts from the web: starting with tiny bits of information – like the current time or geographical location of say Ulan Bator; or maybe the current headline of the The Sun – up to retrieving hundreds of speeches, texts and other documents from dozens of different web pages. Around the world. Each day. Every 5 minutes. For the next ten months.

Web scraping can give new insights into your social science research project or indeed help you develop a complete new subject by making data available that just recently started to exist – e.g. Twitter and blog posts or webpage linkage networks – or it simply might push the scales of the amount of data that can be used from traditional sources to new limits – e.g. working with thousands of speeches, news articles, or law proposals instead of just a handful.

Although, web scraping is not a new technique, it just starts to get recognized by a broad social science audience. So far, statistical software used by social scientists like SPSS, Stata or R did not have web scraping capabilities while software that had the means for web scraping like PHP, Pearl or Python was alien to social scientists. Furthermore, web scraping tends to get presented either purely technical in form of ‘from programmers for programmers’ manuals or as specialized single-issue case studies that make it hard to get the general picture. Being well aware that social scientists usually want to do substantive research instead of learning yet another software, the course builds on a thorough but gentle, hands on web scraping textbook written by social scientists – my colleagues and me – for social scientists: Automated Data Collection with R - A Practical Guide to Web Scraping and Text Mining. The book rests on three essential pillars: (1) providing introductions to web technologies and associated tools, (2) presenting handy R packages and howtos for web scraping and (3) illustrating real life applications with case studies. In addition each chapter (except case studies) comes with a set of exercises and solutions making it ‘the book we would have wanted to have when starting to scrape the web ourselves’.

The main goal of the course is to give you a solid overview on the most important web technologies, how they are connected and which tools are available in R to handle them. Having completed the course you should be set up to handle simple scraping scenarios: downloading various file formats, extracting lists and tables from HTML documents or retrieve information that is easily accessible; up to starting your own large scale project.

Day-to-Day Schedule

Day 
Topic 
Details 
1Introduction. Basics Regular ExpressionsIn the first part of the session we will work on getting an overview of the universe of web technologies and how they are interrelated. Furthermore we will explore R’s base capabilities to gather data from the web as well as R’s means to store data and manipulate files. In the second part we will start to learn about Regular Expressions and how to use them to handle text and extract information.
2HTML and XML XPath, CSS-Selectors JSONThe second session is all about frequently used web formats and tools that help to extract information from them like XPath and CSS-Selectors.
3HTTP, web forms, Cookies, JavaScriptIn the third session we will handle more advanced topics, which problems we might encounter and how to solve them.
4Web services and APIsHaving learned a great deal about web formats and how to handle them in the fourth session we will broaden our view to ‘ready made’ web services, what they might offer and how we might incorporate them into R.
5Managing projects and doing researchThe last session is reserved for less technical and more research focused problems like how to best approach web data gathering projects and which problems might come with using web data instead of more traditional forms and sources of data.
Day-to-Day Reading List

Day 
Readings 
0All readings if not indicated otherwise are based on: Munzert, Rubba, Meißner, Nyhuis (2014): Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Wiley, numbers in parentheses indicate chapter numbers
1Preface, Introduction (1), Regular Expressions and String Functions (8), all Introductions to chapter 2-7 (everything until the first headline)
2HTML (2), XML (3), XPATH (4), https://cran.rstudio.com/web/packages/httr/vignettes/quickstart.html
3HTTP (5), AJAX (6), https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html , https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html
4Scraping the Web (9), https://zapier.com/learn/apis/,
5Managing Projects (11)
Software Requirements

Newest R and RStudio.

Hardware Requirements

Students must bring their own laptops.

Literature


Munzert, Simon, Rubba, Christian, Meißner, Peter, and Dominic Nyhuis (2014): Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Wiley

Barberá, Pablo, 2015: Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data. Political Analysis 23:76–91.

Barberá, Pablo, 2014: How Social Media Reduces Mass Political Polarization. Evidence from Germany, Spain, and the U.S. Unpublished Manuscript.

Barberá, Pablo, and Gonzalo Rivero, 2014: Understanding the Political Representativeness of Twitter users. Social Science Computer Review 1–18.

Chadefaux, Tomas, 2014: Early warning signals for war in the news. Journal of Peace Research 51:5–18.

Enos, Ryan D., and Anthony Fowler, 2014: The Effects of Large-Scale Campaigns on Voter Turnout: Evidence from 400 Million Voter Contacts. Unpublished Manuscript.

Gayo-Avello, Daniel, 2013: A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data. Social Science Computer Review 31:649–679.

Gill, Michael, and Arthur Spirling, 2015: Estimating the Severity of WikiLeaks U.S. Diplomatic Cables Disclosure. Political Analysis 23:299–305.

Gohdes, Anita R., 2015: Pulling the plug: Network disruptions and violence in civil confict. Journal of Peace Research 52:1–16.

Hassanpour, Navid, 2013: Tracking the Semantics of Politics: A Case for Online Data Research in Political Science. PS: Political Science & Politics 46:299–306.

King, Gary, Jennifer Pan, and Margaret E. Roberts, 2013: How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review 107:326–343.

Rød, Espen Geelmuyden, and Nils B. Weidmann, 2015: Empowering activists or autocrats? The Internet in authoritarian regimes. Journal of Peace Research 52:1–14.

Sagi, Eyal, and Morteza Dehghani, 2014: Measuring Moral Rhetoric in Text. Social Science Computer Review 32:132–144.

Shaw, Aaron, and Bejamin Mako Hill, 2014: Laboratories of Oligarchy? How The Iron Law Extends to Peer Production. Journal of Communication 64:215–238.

Slapin, Jonathan B., and Sven-Oliver Proksch, 2008: A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science 52:705–722.

Street, Alex, Tomas A. Murray, John Blitzer, and Rajan S. Patel, 2015: Estimating Voter Registration Deadline Effects with Web Search Data. Political Analysis 1–2.

Wu, Shaomei, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts, 2011: Who Says What to Whom on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, WWW ‘11, 705–714. New York, NY, USA: ACM.

Zeitzof, Tomas, 2011: Using Social Media to Measure Confict Dynamics: An Application to the 2008 - 2009 Gaza Confict. Journal of Confict Resolution 55:938–969.

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

Summer School:

  • Introduction to the Use of R

Winter School:

  • Introduction to R
Recommended Courses After

Summer School:

  • Research Data Management and Open Data Geographic Information Systems (GIS) for the Social Sciences

Winter School:

  • Inferential Network Analysis
  • Introduction to Applied Social Network Analysis
  • Quantitative Text Analysis

Additional Information

Disclaimer

The information contained in this course description form may be subject to subsequent adaptations (e.g. taking into account new developments in the field, specific participant demands, group size etc.). Registered participants will be informed in due time in case of adaptations.

Note from the Academic Convenors

By registering to this course, you certify that you possess the prerequisite knowledge that is requested to be able to follow this course. The instructor will not teach these prerequisite items. If you are not sure if you possess this knowledge to a sufficient level, we suggest you contact the instructor before you proceed with your registration.


Share this page
 

"Aristocracies … may preserve themselves longest, but only democracies, which refresh their ruling class, can expand" - Hugh Trevor-Roper


Back to top