ECPR Summer School
Central European University, Budapest
26 July - 10 August 2018




SC105 - Big Data Analysis in the Social Sciences

Instructor Details

Instructor Photo

Pablo Barberá

Institution:
University of Southern California

Instructor Bio

 

Pablo Barberá gained his PhD in Politics from New York University. He currently works in LSE's Methodology Department as an Assistant Professor of Computational Social Science.

Previously, Pablo has been Assistant Professor at the University of Southern California and Moore-Sloan Postdoctoral Fellow at the Center for Data Science in New York University.

His primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

Pablo is an active contributor to the open source community and has authored several R packages to mine social media data.

  @p_barbera


Course Dates and Times

Monday 6 August - Friday 10 August

14:00-15:30 / 16:00-17:30

Prerequisite Knowledge

The course will assume intermediate familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Short Outline

Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V's of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Tasks for ECTS Credits

  • Participants attending the course: 2 credits (pass/fail grade) The workload for the calculation of ECTS credits is based on the assumption that students attend classes and carry out the necessary reading and/or other work prior to, and after, classes.
  • Participants attending the course and completing one task (see below): 3 credits (to be graded)
  • Participants attending the course, and completing two tasks (see below): 4 credits (to be graded)

For an additional credit, participants will be required to submit 3 out of the 4 challenges in the Reading List, see section below.

For an additional 2 credits, particpants will also need to submit a 5-page project applying techniques covered in the course to a substantive social science project.

Long Course Outline

The volume and heterogeneity of the new datasets available in the digital age present unprecedented opportunities for social scientists, but also new methodological challenges. Computing a simple average for a variable across groups can take minutes when a researcher is working with government records, large-scale survey studies or social media datasets with millions of rows. A corpus of legislative speeches can often include thousands of documents, possibly in multiple languages, making it impossible to rely on human annotators to classify the issue to which each speech refers. The goal of this course is to learn how to overcome these challenges by learning the computational methods required to explore and analyze “Big Data” from a variety of sources using the R programming language.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are required to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

The course will begin with a discussion of the concept of "Big Data" and the research opportunities and challenges of the use of massive-scale datasets in the social sciences. The first session will also provide a foundation of R coding skills upon which we will rely during the rest of the course. Here, we will go over existing packages to efficiently analyze large-scale datasets in R, how to parallelize for loops, and how to read and write large files.

The second session will focus on the most common application of Big Data in the social sciences: large-scale text classification. After a quick overview of the basics of machine learning, we will discuss specific details of the implementation of supervised learning algorithms in massive-scale datasets, and in particular recently-developed methods in computer science such as xgboost, and ensemble classifiers. Our emphasis will lie on the practical aspects: we will study these methods in the context of an application of rhetoric styles in tweets by political candidates, and will go through the entire research process, from the creation of a training dataset labeled by humans using crowd-sourcing platforms, to the application and validation of the machine learning algorithm, and passing through all the intermediate steps, such as cleaning and preprocessing the corpus of documents.

Exploratory data analysis can be a powerful tool for social scientists when they are interested in analyzing a new dataset. The third and fourth sessions will go over the existing tools for large-scale discovery in “Big Data” using R, applied to both text and network datasets. We will start with topic models, which allow researchers to automatically identify latent classes of documents in a corpus, with an application to the classification of newspaper articles about the economy into different categories. Then, we will turn our attention to social networks, and in particular the detection of communities of individuals with shared interests or political preferences. The running example in this part of the course will be the classification of Twitter users along a latent ideological dimension based on the structure of the networks in which they are embedded. A common theme to both sessions will be the emphasis on validation: once an unsupervised model is completed, how can we measure the quality of the results? We will discuss basics concepts of measurement theory, and best practices in the validation of the results from unsupervised scaling models.

The course will conclude in its fifth session with an introduction to database management for social scientists. We will learn the basics of SQL, a language designed to query relational databases that is currently used by most tech companies; and how to use it from R using the DBI package. From all the available options to store online databases, we will focus on BigQuery, which relies on Google’s infrastructure to efficiently store and query databases at scale. We will learn how to process, upload, and query databases of up to a billion rows in a matter of seconds, and how to export the results of our queries.

Day-to-Day Schedule

Day 
Topic 
Details 
1Introduction to Big Data in the social sciences.

Session 1. What is Big Data? Applications to the social sciences.

Session 2. Efficient data analysis with R. Parallel computing with R.

2Automated text classification

Session 1. Supervised machine learning: regularized regression, support vector machines, random forests.

Session 2. Large-scale text classification

3Large-scale discovery (I)

Session 1. Exploratory analysis of large-scale text datasets.

Session 2. Topic models.

4Large-scale discovery (II)

Session 1. Large-scale discovery in networks: community detection.

Session 2. Latent space network models.

5Querying large-scale databases using SQL and BigQuery.

Session 1. Introduction to SQL.

Session 2. Querying billion-row datasets with Google BigQuery.

Day-to-Day Reading List

Day 
Readings 
1

No challenges assigned

2

Challenge: parallel computing

3

Challenge: large-scale text classification of newspaper articles

4

Challenge: topic models applied to Facebook posts

5

Challenge: latent space models of Twitter networks

Software Requirements

This course will use R, which is a free and open-source programming language primarily used for statistics and data analysis. We will also use RStudio, which is an easy-to-use interface to R.

Installing R or RStudio prior to the course is not necessary. The instructor will provide individual login details to an RStudio Server that all workshop participants can access to run their code.

Hardware Requirements

Students are expected to bring their own laptops to class. There are no specific requirements other than being able to use a browser (Google Chrome is the recommended option, but others should work too).

Literature

Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A., & Bonneau, R. (2015). Tweeting From Left to Right Is Online Political Communication More Than an Echo Chamber?. Psychological science.

González-Bailón, S., Borge-Holthoefer, J., & Moreno, Y. (2013). Broadcasters and hidden influentials in online protest diffusion. American Behavioral Scientist.

González-Bailón, S., & Paltoglou, G. (2015). Signals of public opinion in online communication: A comparison of methods and data sources. The ANNALS of the American Academy of Political and Social Science659(1), 95-107.

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Monroe, B. L., Pan, J., Roberts, M. E., Sen, M., & Sinclair, B. (2015). No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics48(01), 71-74.

Nagler, J., & Tucker, J. A. (2015). Drawing inferences and testing theories with big data. PS: Political Science & Politics48(01), 84-88

Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., ... & Jebara, T. (2009). Life in the network: the coming age of computational social science. Science (New York, NY)323(5915), 721.

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Albertson, B., ... & Rand, D. (2014). Topic models for open ended survey responses with applications to experiments. American Journal of Political Science58, 1064-1082.

Steinert-Threlkeld, Z. C., Mocanu, D., Vespignani, A., & Fowler, J. (2015). Online social networks and offline protest. EPJ Data Science, 4(1), 1.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

Summer School

Automated Collection of Web and Social Data

Introduction to Quantitative Text Analysis

Social Networks: Theoretically Informed Analysis with UCINET

Winter School

Inferential Network Analysis

Recommended Courses After

Summer School

Quantitative Text Analysis

Advanced Social Network Analysis and Visualisation with R

Winter School

Methods of Modern Causal Analysis Based on Observational Data

Analysing Political and Social Sequences

Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Convenors

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.


Share this page
 

"To govern is to choose" - Duc de Lévis


Back to top