Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 31 July - Friday 4 August
14:00-17:30
Please see Timetable for full details.
An increasingly vast wealth of data is freely available on the web -- from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Facebook and Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.
Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.
Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.
His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.
Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.
Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.
The course is structured in three blocks. The first part (Day 1) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research, and we will leave some time at the end of class to go over one or two.
NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The central part of the course (Days 2 and 3) will focus on how we can develop our own set of structured http requests to query an API. In our second session, we will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Sunlight Congress API (to query parliamentary speeches in the US). In the third session, we will learn how to use the most popular R packages to query social media APIs: rtweet, streamR, and Rfacebook. These packages allow researchers to collect tweets filtered by keywords, location, and language in real time, and to scrape public Facebook pages, including likes and comments. The process of collecting and storing the data will be illustrated with examples from published research on social media.
An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. The last block of the course (Days 4 and 5) teaches good practices to clean and rearrange data obtained from the web. We will start with two data formats that are relatively new to social sciences – text and network data. Through a series of applied examples, the materials here will explain how to convert a corpus of documents into a data matrix that is ready for analysis – from dealing with encoding issues, preprocessing text in different languages, and efficiently building a document-feature matrix with quanteda – and how to work with network data – covering the basics of network analysis, how to identify nodes and edges, and how to create an adjacency matrix with igraph. During this session, we will also learn how to extract data from PDF files, both in text and table formats. The course will conclude with a closer look at best practices in statistical computing, building upon the different examples used during the first four days. By learning how to efficiently parallelize loops and work with lists, students will be able to scale up their data collection processes. We will also cover how to merge datasets from different sources, in cases with both identical and similar merging keys (i.e. when a numeric ID is the same across datasets, but also when only country names with slightly different spellings are the only variable that is common to multiple datasets), and how to efficiently compute summary statistics and other aggregated estimates based on a data frame with dplyr.
The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.
After the course, students will have an advanced understanding of the web and social data available for social science research, and will be equipped with the technical skills necessary to collect and clean such datasets on their own.
The course will assume familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least two hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for two hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
Monday | Webscraping |
Session 1. Basics of webscraping. Scraping web data in table format and unstructured format. Session 2. Scraping data behind web forms with Selenium. |
Tuesday | Working with APIs |
Session 1. What is an API? Interacting with an API. Session 2. Applied examples: the New York Times API and the Sunlight Congress API. |
Wednesday | Collecting social media data |
Session 1. Collecting Twitter data: tweets filtered by keyword and location; user profiles and tweets. Session 2. Collecting data from Facebook pages |
Thursday | Cleaning unstructured data |
Session 1. Cleaning text data using regular expressions; building network datasets. Session 2. Extracting data from PDF files: text, tables, metadata. |
Friday | Data manipulation |
Session 1. Scaling up data collection: for loops, vectorization, working with lists, merging multiple datasets. Session 2. Reshaping data with dplyr. |
The course will use the open-source software R, which is freely available for download at https://www.r-project.org/ . We will interact with R through RStudio, which can be downloaded at https://www.rstudio.com/products/rstudio/download/ Students should download the most recent version at the time of the course (currently R 3.3.3 and RStudio 1.0.136). We will also utilize the following R packages: rvest, jsonlite, httr, igraph, rtweet, streamR, quanteda, Rfacebook, dplyr.
Students are expected to bring their own laptops to class. A laptop with a standard setup (4GB of RAM or more, at least 10GB of free disk space) should be sufficient.
Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.
Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.
Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly
Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R. Packt Publishing Ltd.
<p>SA102 - R Basics</p> <p>SA111 - Effective Data Management with R</p>
<p>SC104 - Big Data Analysis in the Social Sciences</p> <p>SA110 - Introduction to Exploratory Network Analysis</p> <p>SD103 - Introduction to Manual and Computer-Assisted Content Analysis</p>