ECPR Summer School
Central European University, Budapest
27 July - 12 August 2017




SC104 - Big Data Analysis in the Social Sciences

Instructor Details

Instructor Photo

Pablo Barberá

Institution:
University of Southern California

Instructor Bio

Pablo Barberá gained his PhD in Politics from New York University.

He is currently an Assistant Professor of International Relations at the University of Southern California, and a former Moore-Sloan Fellow at the Center for Data Science at NYU.

Pablo's primary research interests include quantitative political methodology and computational social science, applied to the study of political and social behaviour.

He is an active contributor to the open source community and has authored several R packages to mine social media data.


Course Dates and Times

Monday 7 - Friday 11 August

14:00-17:30

Please see Timetable for full details.

Location
Building: N13 Room: 517
Prerequisite Knowledge

The course will assume familiarity with the R statistical programming language. Participants should be able to know how to read datasets in R, work with vectors and data frames, and run basic statistical analyses, such as linear regression. More advanced knowledge of statistical computing, such as writing functions and loops, is helpful but not required.

Short Outline

Massive-scale datasets from web sources and social media, newly digitized text sources, and large longitudinal survey studies present exciting opportunities for the study of social and political behaviour, but at the same time its size and heterogeneity present significant challenges. This course will introduce participants to new computational methods and tools required to explore and analyse Big Data in the social sciences using the R programming language. It will be structured around techniques to deal with the 3 V's of Big Data: volume, variety, and veracity. First, we will cover the basics of parallel programming and cloud computing to analyse large-scale datasets. Second, we will learn how to scale human tasks through the use of machine learning methods. Finally, we will discuss how to automatically discover insights from large text and network datasets and validate the output of this analysis. The course will follow a "learning-by-doing" approach, with short theoretical sessions followed by "data challenges" where participants will need to apply new methods.

Long Course Outline

The volume and heterogeneity of the new datasets available in the digital age present unprecedented opportunities for social scientists, but also new methodological challenges. Computing a simple average for a variable across groups can take minutes when a researcher is working with government records, large-scale survey studies or social media datasets with millions of rows. A corpus of legislative speeches can often include thousands of documents, possibly in multiple languages, making it impossible to rely on human annotators to classify the issue to which each speech refers. The goal of this course is to learn how to overcome these challenges by learning the computational methods required to explore and analyze “Big Data” from a variety of sources using the R programming language.

The course will follow a "learning-by-doing" approach. Code and data for all the applications will be provided, and students are required to bring their own laptops to follow along. Each session will start with around 45-60 minutes of lecturing and coding led by the instructor, followed by "data challenges" where participants will need to apply what they have learned to new datasets. Although these data challenges will be started in class, participants will be asked to complete them after the end of each session. Solutions for each challenge will be posted by the beginning of the following class, and we will leave time for questions.

The course will begin with a discussion of the concept of "Big Data" and the research opportunities and challenges of the use of massive-scale datasets in the social sciences. The first session will also provide a foundation of R coding skills upon which we will rely during the rest of the course. Here, we will go over existing packages to efficiently analyze large-scale datasets in R, how to parallelize for loops, and how to read and write large files.

The second session will focus on the most common application of Big Data in the social sciences: large-scale text classification. After a quick overview of the basics of machine learning, we will discuss specific details of the implementation of supervised learning algorithms in massive-scale datasets, and in particular recently-developed methods in computer science such as stochastic gradient descent, xgboost, and ensemble classifiers. Our emphasis will lie on the practical aspects: we will study these methods in the context of an application of sentiment analysis to newspaper articles, and will go through the entire research process, from the creation of a training dataset labeled by humans using crowd-sourcing platforms, to the application and validation of the machine learning algorithm, and passing through all the intermediate steps, such as cleaning and preprocessing the corpus of documents.

Exploratory data analysis can be a powerful tool for social scientists when they are interested in analyzing a new dataset. The third and fourth sessions will go over the existing tools for large-scale discovery in “Big Data” using R, applied to both text and network datasets. We will start with topic models, which allow researchers to automatically identify latent classes of documents in a corpus, with an application to the classification of Facebook posts by politicians into relevant political issues. Then, we will turn our attention to social networks, and in particular the detection of communities of individuals with shared interests or political preferences. The running example in this part of the course will be the classification of Twitter users along a latent ideological dimension based on the structure of the networks in which they are embedded. A common theme to both sessions will be the emphasis on validation: once an unsupervised model is completed, how can we measure the quality of the results? We will discuss basics concepts of measurement theory, and best practices in the validation of the results from unsupervised statistical models.

The course will conclude in its fifth session with an introduction to cloud computing and database management for social scientists. Most available online resources and courses on these topics assume students are proficient in UNIX or have a background in programming. Here, however, we will start from scratch and focus on the coding skills required to conduct statistical analyses with data hosted in the “cloud”, while at the same time helping participants become familiar with programming concepts that can facilitate future collaborations with computer scientists. We will cover the most important commands in UNIX – the language required to interact with High-Performance Clusters (HPC), for example, which are now available in most universities – and test our skills in an online virtual machine hosted on Amazon Elastic Compute Cloud (EC2). In the second half of this session, we will learn the basics of SQL, and run our own queries in a dataset with over a billion geolocated tweets hosted in Google BigQuery.

Day-to-Day Schedule

Day 
Topic 
Details 
MondayIntroduction to Big Data in the social sciences.

Session 1. What is Big Data? Applications to the social sciences.

Session 2. Parallel programming with R.

TuesdayLarge-scale classification

Session 1. Supervised machine learning: regularized regression, support vector machines, random forests.

Session 2. Large-scale text classification. Crowd-sourcing the creation of training datasets.

WednesdayLarge-scale discovery (I)

Session 1. Unsupervised machine learning: PCA, topic models (LDA, STM).

Session 2. Reducing high-dimensional data for visualization and analysis.

ThursdayLarge-scale discovery (II). Measurement theory.

Session 1. Large-scale discovery in networks: latent space models.

Session 2. Measurement theory: validity and reliability.

FridayCloud computing. Basics of SQL

Session 1. Basics of UNIX. Cloud computing using an online virtual machine.

Session 2. Querying billion-row datasets with SQL and Google BigQuery.

Day-to-Day Reading List

Day 
Readings 
 

As the course progresses, the reading load will decrease and students will be required to work on the coding challenges introduced in class. The readings and coding challenges should be completed before the beginning of the class listed below.

Monday

- Lazer, D., & Radford, J. (2017). Data ex Machina: Introduction to Big Data. Annual Review of Sociology, 43(7):1-21.

- Golder, S. A., & Macy, M. W. (2014). Digital footprints: Opportunities and challenges for online social research. Sociology40(1), 129.

- Ruths, D., & Pfeffer, J. (2014). Social media for large studies of behavior. Science346(6213), 1063-1064

Tuesday

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer. (Chapters 1, 3, and 5)

- Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2015). Crowd-sourced text analysis: reproducible and agile production of political data. American Political Science Review.

Wednesday

- Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, mps028.

- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM55(4), 77-84.

Thursday

- Barberá, P. (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis23(1), 76-91.

Friday

No readings assigned

Software Requirements

The course will use the open-source software R, which is freely available for download at https://www.r-project.org/ . We will interact with R through RStudio, which can be downloaded at https://www.rstudio.com/products/rstudio/download/ Students should download the most recent version at the time of the course (currently R 3.3.3 and RStudio 1.0.136).

Hardware Requirements

Students are expected to bring their own laptops to class. The preferred configuration would be at least 8GB of RAM or more and 20GB or more of free disk space.

Literature

Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A., & Bonneau, R. (2015). Tweeting From Left to Right Is Online Political Communication More Than an Echo Chamber?. Psychological science.

González-Bailón, S., Borge-Holthoefer, J., & Moreno, Y. (2013). Broadcasters and hidden influentials in online protest diffusion. American Behavioral Scientist.

González-Bailón, S., & Paltoglou, G. (2015). Signals of public opinion in online communication: A comparison of methods and data sources. The ANNALS of the American Academy of Political and Social Science659(1), 95-107.

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press.

Matloff, N. (2011). The art of R programming: A tour of statistical software design. No Starch Press.

Monroe, B. L., Pan, J., Roberts, M. E., Sen, M., & Sinclair, B. (2015). No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics48(01), 71-74.

Nagler, J., & Tucker, J. A. (2015). Drawing inferences and testing theories with big data. PS: Political Science & Politics48(01), 84-88

Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., ... & Jebara, T. (2009). Life in the network: the coming age of computational social science. Science (New York, NY)323(5915), 721.

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Albertson, B., ... & Rand, D. (2014). Topic models for open ended survey responses with applications to experiments. American Journal of Political Science58, 1064-1082.

Steinert-Threlkeld, Z. C., Mocanu, D., Vespignani, A., & Fowler, J. (2015). Online social networks and offline protest. EPJ Data Science, 4(1), 1.

Wickham, H., & Grolemund, G. (2016). R for Data Science. O’Reilly

The following other ECPR Methods School courses could be useful in combination with this one in a ‘training track .
Recommended Courses Before

SC103 - Automated Collection of Web and Social Data

SD304 - Gentle Introduction to Machine Learning for the Social Sciences

Additional Information

Disclaimer

The information contained in this course description form may be subject to subsequent adaptations (e.g. taking into account new developments in the field, specific participant demands, group size etc.). Registered participants will be informed in due time in case of adaptations.

Note from the Academic Convenors

By registering to this course, you certify that you possess the prerequisite knowledge that is requested to be able to follow this course. The instructor will not teach these prerequisite items. If you are not sure if you possess this knowledge to a sufficient level, we suggest you contact the instructor before you proceed with your registration.


Share this page
 

"Wherever you have an efficient government you have a dictatorship" - Harry S. Truman


Back to top