Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Friday 22 February 13:00–15:00 and 15:30–18:00
Saturday 23 February 09:00–12:30 and 14:00–17:30
Web data is rapidly changing empirical social science. Whereas scholars used to be confronted with severe data sparsity problems, web data has completely changed the rules of the game.
Countless data sources are easily accessible and the primary challenge is merely one of collecting and managing the available information. What is more, not only well-funded research projects are able to harness this type of data, but even individual researchers and students can easily generate large-scale web-based datasets.
This course introduces the most important tools for collecting and managing web data. It does so by relying on R, which has vastly expanded its web-scraping functionality in recent years. The major advantage of R for web scraping is that all the steps in a research project can be taken in a coherent framework – data collection, data processing, data analysis and visualisation – ensuring a more friction-free research process.
1 credit (pass/fail grade). Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.
Dominic Nyhuis is a Postdoctoral Researcher at Leibniz University Hannover, Chair for Comparative Politics and the Political System of Germany.
Prior to this, he was affiliated with the Universities of Frankfurt, Vienna, and Mainz.
Dominic received his PhD in Political Science from the University of Mannheim. His research focuses on comparative legislative studies, party politics, and municipal politics. Methodologically, his work relies on quantitative methods, web scraping, and automated text analysis.
Data on the web is fundamentally changing research practices in the social sciences. These changes are not least evidenced by the inception of entirely new research fields, developments associated with terms like computational social science, and data science. These developments carry with them an enormous potential, particularly for young scholars without access to major research funds.
By mastering the tools needed for automated web data collection, a single researcher can construct a dataset that would have required tremendous effort and expense not too long ago. What is more, while the tools are not too difficult to master, they are still sufficiently rare that practitioners can claim a unique, highly sought-after skillset – in academia and beyond.
This course will give you an applied overview of the skills required to automatically collect data from the web.
It will provide an introduction to some of the most important skills and techniques. In particular, it introduces the basic structure of HTML to enable an understanding of the underlying architecture and mechanics of websites. XPath will be introduced as a syntax to address specific elements of websites and tools to extract them as needed. Regular expressions are covered which allow further processing textual data gathered from the web. Client-server interactions via HTTP and the structure of URLs are discussed to understand web interactions in practice. APIs are discussed to equip you with the knowledge of how to collect data from dedicated data access points.
The course's applied elements make use of the programming language R. Although several other languages are still more common for the purposes of web scraping, R has come into its own in recent years with the publication of several extensions to support even complex web-scraping tasks. The main advantage of R for web scraping is that the whole research process – from data collection all the way to data wrangling, analysis, visualisation and publication – can be achieved within the same framework. Many social scientists might already have some familiarity with the language from a different context, making the initial steps in web scraping a little less daunting.
By the end of the course, you should have a basic understanding of fundamental web-scraping techniques, and know how to conduct simple web-scraping tasks from static websites you encounter in your research.
If you aim to accomplish more complex tasks, the course should give you a sense of the self-study techniques required to build on what you have already learned here.
Basic familiarity with R is required.
We will briefly review some of the most important concepts in advanced R at the beginning of the course, but you should already know the basics.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least three hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for three hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
Friday afternoon | 1 Introduction 2 Very brief advanced R refresher 3 HTML + Exercises 4 XPath + Exercises |
1 Overview of the course; Introduction to the most important web technologies; Typical web interactions; Accessing web resources from R |
Saturday morning | HTTP HTML and XML XPath |
Frequently used web formats and tools that help extract specific pieces of information from a website. |
Saturday afternoon | 8 Application 1: Scraping the European Parliament website 9 Application 2: Scraping a news website |
8 Extracting a dataset with information on legislators from the European Parliament website |
Friday afternoon | Basics Regular Expressions |
The first session provides an overview of web technologies. It will explore the base capabilities of R to gather data from the web, and to store data and manipulate files. It will also introduce Regular Expressions and how to use them to handle text and extract information.
|
Saturday morning | 5 Regular expressions + Exercises 6 HTTP/URLs + Exercises 7 API + Exercises |
5 Learning a syntax for extracting pieces of natural text; Introducing the functionality in R for conducting text operations |
Saturday afternoon | Web services and APIs |
The third session introduces web services and APIs, what they offer and how we might incorporate them into R. |
Day | Readings |
---|---|
Friday afternoon |
N/A |
Saturday morning |
N/A |
Saturday afternoon |
N/A |
R and RStudio
Please bring your own laptop.
Khalil, Salim and Mohamed Fakir. 2017. RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.
Lawson, Richard. 2015. Web scraping with Python: Scrape data from any website with the Power of Python. Birmingham: Packt.
Mitchell, Ryan. 2015. Web scraping with Python: Collecting more data from the modern web. Sebastopol: O’Reilly.
Munzert, Simon, Christian Rubba, Peter Meißner and Dominic Nyhuis. 2015. Automated web data collection with R: A practical guide to web scraping and text mining. Hoboken: Wiley.
Nolan, Deborah and Duncan Temple Lang. 2014. XML and web technologies for data sciences with R. New York: Springer.
Introduction to R
Summer School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python
Winter School Programming in the Social Sciences: Web Scraping, Social Media, and New (Big) Data with Python
Quantitative Text Analysis