Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Strategies of Secession and Counter-Secession

SC104 - Big Data Analysis in the Social Sciences

Instructor Details

Instructor Photo

Pablo Barberá

University of Southern California

Instructor Bio


Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.

Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.

His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.


Course Dates and Times

Monday 7 - Friday 11 August


Please see Timetable for full details.

Building: N13 Room: 517
Prerequisite Knowledge

The course will assume familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Short Outline

Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V's of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Long Course Outline

The volume and heterogeneity of the new datasets available in the digital age present unprecedented opportunities for social scientists, but also new methodological challenges. Computing a simple average for a variable across groups can take minutes when a researcher is working with government records, large-scale survey studies or social media datasets with millions of rows. A corpus of legislative speeches can often include thousands of documents, possibly in multiple languages, making it impossible to rely on human annotators to classify the issue to which each speech refers. The goal of this course is to learn how to overcome these challenges by learning the computational methods required to explore and analyze “Big Data” from a variety of sources using the R programming language.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are required to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

The course will begin with a discussion of the concept of "Big Data" and the research opportunities and challenges of the use of massive-scale datasets in the social sciences. The first session will also provide a foundation of R coding skills upon which we will rely during the rest of the course. Here, we will go over existing packages to efficiently analyze large-scale datasets in R, how to parallelize for loops, and how to read and write large files.

The second session will focus on the most common application of Big Data in the social sciences: large-scale text classification. After a quick overview of the basics of machine learning, we will discuss specific details of the implementation of supervised learning algorithms in massive-scale datasets, and in particular recently-developed methods in computer science such as stochastic gradient descent, xgboost, and ensemble classifiers. Our emphasis will lie on the practical aspects: we will study these methods in the context of an application of sentiment analysis to newspaper articles, and will go through the entire research process, from the creation of a training dataset labeled by humans using crowd-sourcing platforms, to the application and validation of the machine learning algorithm, and passing through all the intermediate steps, such as cleaning and preprocessing the corpus of documents.

Exploratory data analysis can be a powerful tool for social scientists when they are interested in analyzing a new dataset. The third and fourth sessions will go over the existing tools for large-scale discovery in “Big Data” using R, applied to both text and network datasets. We will start with topic models, which allow researchers to automatically identify latent classes of documents in a corpus, with an application to the classification of Facebook posts by politicians into relevant political issues. Then, we will turn our attention to social networks, and in particular the detection of communities of individuals with shared interests or political preferences. The running example in this part of the course will be the classification of Twitter users along a latent ideological dimension based on the structure of the networks in which they are embedded. A common theme to both sessions will be the emphasis on validation: once an unsupervised model is completed, how can we measure the quality of the results? We will discuss basics concepts of measurement theory, and best practices in the validation of the results from unsupervised statistical models.

The course will conclude in its fifth session with an introduction to cloud computing and database management for social scientists. Most available online resources and courses on these topics assume students are proficient in UNIX or have a background in programming. Here, however, we will start from scratch and focus on the coding skills required to conduct statistical analyses with data hosted in the “cloud”, while at the same time helping participants become familiar with programming concepts that can facilitate future collaborations with computer scientists. We will cover the most important commands in UNIX – the language required to interact with High-Performance Clusters (HPC), for example, which are now available in most universities – and test our skills in an online virtual machine hosted on Amazon Elastic Compute Cloud (EC2). In the second half of this session, we will learn the basics of SQL, and run our own queries in a dataset with over a billion geolocated tweets hosted in Google BigQuery.

Day-to-Day Schedule

Day-to-Day Reading List

Software Requirements

The course will use the open-source software R, which is freely available for download at . We will interact with R through RStudio, which can be downloaded at Students should download the most recent version at the time of the course (currently R 3.3.3 and RStudio 1.0.136).

Hardware Requirements

Students are expected to bring their own laptops to class. The preferred configuration would be at least 8GB of RAM or more and 20GB or more of free disk space.


Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A., & Bonneau, R. (2015). Tweeting From Left to Right Is Online Political Communication More Than an Echo Chamber?. Psychological science.

González-Bailón, S., Borge-Holthoefer, J., & Moreno, Y. (2013). Broadcasters and hidden influentials in online protest diffusion. American Behavioral Scientist.

González-Bailón, S., & Paltoglou, G. (2015). Signals of public opinion in online communication: A comparison of methods and data sources. The ANNALS of the American Academy of Political and Social Science659(1), 95-107.

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Monroe, B. L., Pan, J., Roberts, M. E., Sen, M., & Sinclair, B. (2015). No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics48(01), 71-74.

Nagler, J., & Tucker, J. A. (2015). Drawing inferences and testing theories with big data. PS: Political Science & Politics48(01), 84-88

Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., ... & Jebara, T. (2009). Life in the network: the coming age of computational social science. Science (New York, NY)323(5915), 721.

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Albertson, B., ... & Rand, D. (2014). Topic models for open ended survey responses with applications to experiments. American Journal of Political Science58, 1064-1082.

Steinert-Threlkeld, Z. C., Mocanu, D., Vespignani, A., & Fowler, J. (2015). Online social networks and offline protest. EPJ Data Science, 4(1), 1.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

SC103 - Automated Collection of Web and Social Data

SD304 - Gentle Introduction to Machine Learning for the Social Sciences

Additional Information


This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Convenors

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.

Share this page