Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 30 July - Friday 3 August
14:00-15:30 / 16:00-17:30
An increasingly vast wealth of data is freely available on the web -- from election results and legislative speeches to social media posts, newspaper articles, and press releases, among many other examples. Although this data is easily accessible, in most cases it is available in an unstructured format, which makes its analysis challenging. The goal of this course is to gain the skills necessary to automate the process of downloading, cleaning, and reshaping web and social data using the R programming language for statistical computing. We will cover all the most common scenarios: scraping data available in multiple pages or behind web forms, interacting with APIs and RSS feeds such as those provided by most media outlets, collecting data from Facebook and Twitter, extracting text and table data from PDF files, and manipulating datasets into a format ready for analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.
Tasks for ECTS Credits
For an additional credit, participants will be required to submit 3 out of the 4 challenges in the Reading List, see section below.
For an additional 2 credits, particpants will also need to submit a 5-page project applying techniques covered in the course to a substantive social science project.
Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.
Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.
His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.
Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.
Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. This course will provide participants with an overview of the new sources of data that are now available, and the computational tools required to collect the data, clean it, and organize it in a format amenable to statistical analysis.
The course is structured in three blocks. The first part (Days 1 and 2) will offer an introduction to the course and then dive into the basics of webscraping; that is, how to automatically collect data from the web. This session will demonstrate the different scenarios for webscraping: when data is in table format (e.g. Wikipedia tables or election results), when it is in an unstructured format (e.g. across multiple parts of a website), and when it is behind a web form (e.g. querying online databases). The tools available in R to achieve these goals – the rvest and RSelenium packages – will be introduced in the context of applied examples in the social sciences. Students are also encouraged to come to class with examples from their own research.
NGOs, public institutions, and social media companies increasingly rely on Application Programming Interfaces (API) to give researchers and web developers access to their data. The second part of the course (Day 3) will focus on how we can develop our own set of structured http requests to query an API. We will discuss the components of an API request and how to build our own authenticated queries using the httr package in R. We will apply these skills to two examples: the New York Times API (to query newspaper articles) and the Clarifai API (to automatically tag visual content with machine learning).
The third part (Days 4 and 5) will teach how to collect and analyze data from social media sites. We will begin with an overview of the research opportunities and challenges of using social media data in the social sciences. We will then discuss the data available through Twitter’s REST and Streaming API. As part of the guided coding exercises, we will learn how to collect tweets filtered by keywords, location, and language in real time using different R packages; and how to analyze the data to find the most mentioned hashtags and users and to map the location of the tweets. Our last session will demonstrate how to scrape public Facebook pages through the Graph API using the Rfacebook package. As an illustration of how to analyze tweets and Facebook posts collected with these methods, we will use a dictionary method to characterize politicians’ rhetoric on social media.
An underappreciated part of the research process is data manipulation – it is rarely the case that the dataset we want to use in a study is available in a format suitable for analysis. Data “munging” is tedious, but there are ways to make it more efficient and replicable. Our last session will also discuss some good practices to clean and rearrange data obtained from the web.
The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are encouraged to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.
After the course, students will have an advanced understanding of the web and social data available for social science research and will be equipped with the technical skills necessary to collect and clean such datasets on their own.
The course will assume intermediate familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least two and a half hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for two and half hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
1 | Webscraping (part 1) |
Session 1. Basics of webscraping. Scraping web data in table format and unstructured format. Session 2. Scraping data in unstructured format. Loops, vectorised functions, and lists in R. |
2 | Webscraping (part 2) |
Session 1. Scraping data behind web forms with Selenium. Session 2. Extracting media text from newspaper articles using RSS feeds. |
3 | Working with APIs. |
Session 1. What is an API? Interacting with the New York Times and Clarifai APIs. Session 2. Extracting data from PDF files: text, tables, metadata. |
4 | Collecting and analysing social media data (part 1) |
Session 1. Collecting Twitter data from the Streaming API: tweets filtered by keyword and location. Session 2. Collecting Twitter data from the REST API: user profiles and recent tweets. |
5 | Collecting and analysing social media data (part 2) |
Session 1. Collecting data from Facebook pages. Session 2. Dealing with character encoding issues. Merging and reshaping datasets. Exception handling. |
Day | Readings |
---|---|
1 |
No challenges assigned. |
2 |
Challenge: scraping the American President Project website |
3 |
Challenge: scraping articles from The Guardian |
4 |
Challenge: analysing the Amnesty International Annual Report |
5 |
Challenge: content analysis of a politician’s Twitter feed |
This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.
Installing R or RStudio prior to the course is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.
Students are expected to bring their own laptops to class. There are no specific requirements other than being able to use a browser (Google Chrome is the recommended option, but others should work too).
Klašnja, M., Barberá, P., Beauchamp, N., Nagler, J., & Tucker, J. (2017). “Measuring public opinion with social media data.” In The Oxford Handbook of Polling and Survey Methods.
Lazer, D. & and Radford, J. (2017). “Data ex Machina: Introduction to Big Data.” Annual Review of Sociology.
Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.
Munzert, S., Rubba, C., Meißner, P., & Nyhuis, D. (2014). Automated data collection with R: A practical guide to web scraping and text mining. John Wiley & Sons.
Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R. Packt Publishing Ltd.
Ruths, D., & Pfeffer, J. (2014). “Social media for large studies of behavior.” Science, 346(6213), 1063-1064.
Salganik, M. (2017). Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press.
Steinert-Threlkeld, Z. (2018) Twitter as Data. Cambridge University Press.
Theocharis, Y., Barberá, P., Fazekas, Z., Popa, S. A. and Parnet, O. (2016), “A Bad Workman Blames His Tweets: The Consequences of Citizens’ Uncivil Twitter Use When Interacting With Party Candidates.” Journal of Communication, 66: 1007–1031.
Tucker, J. A., Theocharis, Y., Roberts, M. E., & Barberá, P. (2017). “From Liberation to Turmoil: Social Media And Democracy.” Journal of Democracy, 28(4), 46-59.
Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly
Summer School
R Basics
Effective Data Management with R
Summer School
Big Data Analysis in the Social Sciences
Introduction to Exploratory Network Analysis
Introduction to Manual and Computer-Assisted Content Analysis
Quantative Text Anaysis
Advanced Social Network Analysis and Visualisation with R
Winter School
Inferential Network Analysis