Data on the web is fundamentally changing research practices in the social sciences. These changes are not least evidenced by the inception of entirely new research fields, developments associated with terms like computational social science, and data science. These developments carry with them an enormous potential, particularly for young scholars without access to major research funds.
By mastering the tools needed for automated web data collection, a single researcher can construct a dataset that would have required tremendous effort and expense not too long ago. What is more, while the tools are not too difficult to master, they are still sufficiently rare that practitioners can claim a unique, highly sought-after skillset – in academia and beyond.
This course will give you an applied overview of the skills required to automatically collect data from the web.
It will provide an introduction to some of the most important skills and techniques. In particular, it introduces the basic structure of HTML to enable an understanding of the underlying architecture and mechanics of websites. XPath will be introduced as a syntax to address specific elements of websites and tools to extract them as needed. Regular expressions are covered which allow further processing textual data gathered from the web. Client-server interactions via HTTP and the structure of URLs are discussed to understand web interactions in practice. APIs are discussed to equip you with the knowledge of how to collect data from dedicated data access points.
The course's applied elements make use of the programming language R. Although several other languages are still more common for the purposes of web scraping, R has come into its own in recent years with the publication of several extensions to support even complex web-scraping tasks. The main advantage of R for web scraping is that the whole research process – from data collection all the way to data wrangling, analysis, visualisation and publication – can be achieved within the same framework. Many social scientists might already have some familiarity with the language from a different context, making the initial steps in web scraping a little less daunting.
By the end of the course, you should have a basic understanding of fundamental web-scraping techniques, and know how to conduct simple web-scraping tasks from static websites you encounter in your research.
If you aim to accomplish more complex tasks, the course should give you a sense of the self-study techniques required to build on what you have already learned here.