ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Back to Panel Details
Back to Panel Details

Automated Web Data Collection with R - FULLY BOOKED - contact afoley@ecpr.eu to be added to a waiting list

Dominic Nyhuis
d.nyhuis@ipw.uni-hannover.de

Universität Hannover

Dominic Nyhuis is a Postdoctoral Researcher at Leibniz University Hannover, Chair for Comparative Politics and the Political System of Germany.

Prior to this, he was affiliated with the Universities of Frankfurt, Vienna, and Mainz.

Dominic received his PhD in Political Science from the University of Mannheim. His research focuses on comparative legislative studies, party politics, and municipal politics. Methodologically, his work relies on quantitative methods, web scraping, and automated text analysis.

 @dominic_nyhuis


Course Dates and Times

Course Dates and Times

Friday 3 March: 13:00-15:00 and 15:30-17:00
Saturday 4 March: 09:30-12:00 and 13:00-14:30
7.5 hours over two days

Prerequisite Knowledge

An intermediate knowledge of R is a prerequisite the course, as there is no time for an R refresher. The following questions can help you assess whether your familiarity with R suffices to follow the course:

  • Do you know how to subsets vectors, data frames and lists?
  • Can you transform lists into data frames?
  • Are you familiar with the apply functionality?
  • Can you write a loop?
  • Can you write a function?

If the answer to all of these is yes, you are well prepared for the course. If the answer is no, but you still want to attend the course, there are many well-written books that provide great introductions to R

  • Crawley, Michael J. 2012. The R Book, 2nd Edition. Hoboken, NJ: John Wiley & Sons.
  • Adler, Joseph. 2009. R in a Nutshell. A Desktop Quick Reference. Sebastopol, CA: O’Reilly.
  • Teetor, Paul. 2011. R Cookbook. Sebastopol, CA: O’Reilly.

Besides these, there is also a lot of free information on the Web. You’ll find introductions and manuals at http://cran.r-project.org/manuals.html or http://cran.r-project.org/other-docs. A find online tutorial for beginners is available at http://tryr.codeschool.com/. Additionally, Quick-R (http://www.statmethods.net/) is a good reference site for many basic commands. You can also find a lot of free resources and examples at http://www.ats.ucla.edu/stat/r/ and if you want to dive deeper into R, have a look at Hadley Wickham’s page at http://adv-r.had.co.nz/.


Short Outline

The increasing availability of online data is rapidly changing empirical social science. Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game. Countless sources of human behaviour are easily accessible and the primary challenge is one of collecting and managing the available information. What is more, not only are large research projects able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based data sets.

This course introduces some of the techniques necessary to collect and manage web data. It does so by relying on R, which has vastly extended its functionality for performing web scraping in recent years. The major advantage of R in web scraping is that all the steps in a research project can be taken in a common software solution – data collection, data processing, data analysis and visualization –, ensuring a more friction-free research process. At the end of the course, you should be able to extract information from websites to build your own data set of, e.g., press releases, news articles, or parliamentary speeches.


Long Course Outline

Automated web data collection with R is a course designed to provide a primer on web scraping with R, i.e. the automated gathering of data and text from the web. Web scraping can enable new perspectives on social science research problems, but it can even help to develop new subjects altogether by making data available that just recently came into being – e.g., social media and blog posts or webpage linkage networks. Similarly, web scraping pushes the scales on the amount of data available to researchers.

Although web scraping is not new, it is only starting to be recognized by a broad social science audience. In the past, statistical software used by social scientists like SPSS, Stata or R did not have web scraping capabilities, while software that had the means for web scraping like PHP, Pearl or Python was alien to social scientists. Furthermore, web scraping tends to be presented either purely technical in the form of ‘from programmers for programmers’ manuals or as specialized single-issue case studies that make it hard to get the general picture. Being well aware that social scientists usually want to do substantive research instead of learning yet another software, the course builds on a thorough but gentle, hands on web scraping textbook written by social scientists for social scientists: Automated Data Collection with R - A Practical Guide to Web Scraping and Text Mining. The book rests on three pillars: (1) providing an introduction to web technologies and associated tools, (2) presenting handy R packages and guidance for web scraping and (3) illustrating real life applications with case studies.

The main goal of the course is to provide a solid overview on the most important web technologies, how they are connected and which tools are available in R to handle them. After completing the course, you should be able to handle simple scraping scenarios: downloading various file formats, extracting lists and tables from HTML documents or retrieving easily accessible information.

Day Topic Details
Friday afternoon Basics Regular Expressions

The first session provides an overview of web technologies. It will explore the base capabilities of R to gather data from the web, and to store data and manipulate files. It will also introduce Regular Expressions and how to use them to handle text and extract information.

 

Saturday morning HTTP HTML and XML XPath

The second session deals with frequently used web formats and tools that help extract specific pieces of information from a website.

3 HTTP, web forms, Cookies, JavaScript In the third session we will handle more advanced topics, which problems we might encounter and how to solve them.
4 Web services and APIs Having learned a great deal about web formats and how to handle them in the fourth session we will broaden our view to ‘ready made’ web services, what they might offer and how we might incorporate them into R.
5 Managing projects and doing research The last session is reserved for less technical and more research focused problems like how to best approach web data gathering projects and which problems might come with using web data instead of more traditional forms and sources of data.
Saturday afternoon Web services and APIs

The third session introduces web services and APIs, what they offer and how we might incorporate them into R.

Day Readings
Friday afternoon

N/A

Saturday morning

N/A

Saturday afternoon

N/A

3 HTTP (5), AJAX (6), https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html , https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html
4 Scraping the Web (9), https://zapier.com/learn/apis/,
5 Managing Projects (11)

Software Requirements

R and RStudio

Hardware Requirements

N/A

Literature

Munzert, Simon, Rubba, Christian, Meißner, Peter, and Dominic Nyhuis (2014): Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Wiley

Barberá, Pablo, 2015: Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data. Political Analysis 23:76–91.

Barberá, Pablo, and Gonzalo Rivero, 2014: Understanding the Political Representativeness of Twitter users. Social Science Computer Review 1–18.

Chadefaux, Tomas, 2014: Early warning signals for war in the news. Journal of Peace Research 51:5–18.

Gayo-Avello, Daniel, 2013: A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data. Social Science Computer Review 31:649–679.

Gill, Michael, and Arthur Spirling, 2015: Estimating the Severity of WikiLeaks U.S. Diplomatic Cables Disclosure. Political Analysis 23:299–305.

Gohdes, Anita R., 2015: Pulling the plug: Network disruptions and violence in civil confict. Journal of

Peace Research 52:1–16.

Hassanpour, Navid, 2013: Tracking the Semantics of Politics: A Case for Online Data Research in Political Science. PS: Political Science & Politics 46:299–306.

King, Gary, Jennifer Pan, and Margaret E. Roberts, 2013: How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review 107:326–343.

Rød, Espen Geelmuyden, and Nils B. Weidmann, 2015: Empowering activists or autocrats? The Internet in authoritarian regimes. Journal of Peace Research 52:1–14.

Sagi, Eyal, and Morteza Dehghani, 2014: Measuring Moral Rhetoric in Text. Social Science Computer Review 32:132–144.

Shaw, Aaron, and Bejamin Mako Hill, 2014: Laboratories of Oligarchy? How The Iron Law Extends to Peer Production. Journal of Communication 64:215–238.

Street, Alex, Tomas A. Murray, John Blitzer, and Rajan S. Patel, 2015: Estimating Voter Registration Deadline Effects with Web Search Data. Political Analysis 1–2.

Zeitzof, Tomas, 2011: Using Social Media to Measure Confict Dynamics: An Application to the 2008 - 2009 Gaza Confict. Journal of Confict Resolution 55:938–969.

Recommended Courses to Cover Before this One

<p><strong>Summer School</strong></p> <p>Introduction to R<br /> &nbsp;</p> <p><strong>Winter School </strong></p> <p>Introduction to R<br /> &nbsp;</p>

Recommended Courses to Cover After this One

<p><strong>Summer School</strong></p> <p>Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python<br /> &nbsp;</p> <p><strong>Winter School </strong></p> <p>Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python</p> <p>Quantitative Text Analysis</p>


Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Conveners

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.