ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Introduction to Machine Learning for the Social Sciences

Course Dates and Times

Monday 5 – Friday 9 August

14:00–15:30 / 16:00–17:30 (ending slightly earlier on Friday)

Bruno Castanho Silva

b.paula.castanho.e.silva@fu-berlin.de

Freie Universität Berlin

Machine Learning (ML) is an analytical approach in which users can build statistical models that 'learn' from data to make accurate predictions and decisions.

From customer-recommendation systems to policy design and implementation, machine learning algorithms are becoming ubiquitous in a big data world, and their potential is now starting to be explored in the social sciences.

In this course, you will learn the fundamentals of machine learning as a data analysis approach, and get an overview of the most common and versatile classes of ML techniques in use today.

By the end of this course
You will be able to identify the technique most suited to your question and data, and how to design, test and interpret your models. You will know how to proceed independently towards more advanced algorithms and problems.

This is an introductory course, so maths and programming technicalities will be kept to a minimum. If you can run and interpret multivariate regressions in R, you can (and should!) take this course.

ECTS Credits for this course and, below, tasks for additional credits:

2 credits Attend at least 90% of course hours, participate fully in in-class activities and carry out the necessary reading and/or other work prior to and after classes.

3 credits As above, plus deliver one take-home data analysis assignment.

4 credits As above, plus complete a take-home assignment and one final short research paper with analysis of your own data, to be delivered a week after the course.


Instructor Bio

Bruno Castanho is a postdoctoral researcher at the Cologne Center for Comparative Politics, University of Cologne.

He holds a PhD in political science from Central European University and has experience teaching various topics in research design and methodology, including causal inference, machine learning, and structural equation modelling. 

Bruno has written a textbook on Multilevel Structural Equation Modelling in collaboration with fellow Winter School instructors.

  @b_castanho

Machine learning (ML) in the social sciences is still in its infancy. It has been used recently in areas such as policy evaluation, electoral fraud detection, election forecasting, and quantitative text and social media analysis, but is still not widely known or understood even by many skilled quantitative-oriented social scientists. However, given its rise in related disciplines, such as economics, it must be expected that machine learning will be a new trend in quantitative social science analysis in the coming years.

This course gives you the basic toolkit to understand, evaluate, and build your own machine learning models for research and applied problems. I will introduce ML algorithms conceptually, and we will not go deeply into the maths behind their workings. All are available as off-the-shelf packages and functions in R, so you won’t need to code them from scratch, but only use ready R functions.


Day 1
A general introduction into the logic of machine learning, explaining how ML is related to more familiar statistical methods. We discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effects over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, and supervised vs. unsupervised learning. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbour classifier to illustrate those conceptual points.

The next three days deal with supervised learning, the class of problems in which we know the value of the dependent variable for at least part of our data, meaning we can train models and evaluate the performance of their prediction against actually observed values.

Day 2
We begin with classification problems – those in which the dependent variable is categorical. They are some of the most common problems for which ML is applied. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc. We start briefly with logistic regression as a common baseline, and proceed to cover some of the most common algorithms in use, such as support vector machines (SVM) and decision trees.

Day 3
We move to regression problems – i.e., continuous outcomes. We look at tree-based models such as random forests and boosting. These methods tend to have excellent predictive performance, and can generate interpretable models. They allow users to identify the most important variables for predicting the outcome (substantive importance, not statistical significance), which means they can be applied to any problem in which linear regression could be used.

Day 4
We advance to variable-selection methods: regressions in which independent variables with small coefficients (meaning low predictive power), are penalised and effectively removed, retaining only those that substantively help predict/explain the outcome. These models are especially useful for cases of many more variables than observations (p > N), very common in text analysis. We also discuss and demonstrate the basic aspects of neural networks and deep learning, the most applied type of machine learning in industry today.

Day 5
Dedicated to unsupervised learning. There are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, and outlier detection. We also cover principal components analysis and topic models, which are a variation of unsupervised learning for text-as-data problems. At the end, we will review some applications of machine learning in social sciences, to give you an idea how to design your workflow and present your results to the academic community.


By the end of this course
You will be able to identify what kind of machine learning method best suits the data and question you have at hand (whether it is a classification or regression problem, if supervised or unsupervised, etc), and to fit and interpret the appropriate models covered in class.

This course will be taught from a very beginner’s perspective on statistical learning, and I expect you only to be familiar with concepts and practices of traditional multivariate regression analysis.

All examples will be given in R, so you should also have a working knowledge of it. This means basic tools of data manipulation, plotting, and running statistical models.

R knowledge can be replaced by an advanced knowledge of another programming language, such as Python, to adapt the codes and exercises on your own.

NB: This is an introductory course for students with no prior knowledge of machine learning or algorithms. If you are already familiar with these methods, or have experience with computational sciences, programming, and software development, you are likely to get frustrated with the basic level of instruction on this course.

Day Topic Details
1 General Introduction to Machine Learning

A general introduction into the logic of machine learning, explaining how ML is related to more familiar statistical methods.

We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effect over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off.

We talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, and supervised vs. unsupervised learning.

Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbour classifier to illustrate those conceptual points.

Lecture + Lab

Conceptual issues (and how ML relates to conventional statistical methods in social sciences):

  • Prediction and explanation;
  • Supervised and unsupervised learning;
  • Cross-validation and overfitting;
  • Bias/variance trade-off;
  • Hyperparameters and tuning;
  • K-nearest neighbor classifier to illustrate these issues

Mandatory readings

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013
An Introduction to Statistical Learning with Applications in R
New York: Springer, Ch. 2 and 5

Cranmer, S. J., and Desmarais, B. 2017
What Can We Learn from Predictive Modeling
Political Analysis 25(2): 145–166

2 Supervised learning 1: Classification problems

We discuss classification problems, some of the most common problems for which ML can be applied in social sciences.

It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc.

We cover some of the most common algorithms in use, such as support vector machines (SVM), and decision trees.

Lecture + Lab

  • How to evaluate model performance and predictive accuracy with categorical outcomes
  • Support vector machine
  • Logistic regression
  • Decision trees

Mandatory readings

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013
An Introduction to Statistical Learning with Applications in R
New York: Springer, Ch. 8

 

3 Supervised learning 2: Regression problems

We move to regression problems – i.e., continuous outcomes.

We look again at decision trees and proceed to tree-based models, such as random forests and boosting. These methods tend to have excellent predictive performance, and are also able to generate interpretable models. They allow users to identify what are the most important variables for predicting the outcome (substantive importance, not statistical significance), what means they can be applied to any problem in which linear regression could be used.

Lecture + Lab

  • How to evaluate model performance and predictive accuracy with continuous outcomes
  • Tree-based methods
    • Boosting
    • Random forest

Mandatory readings

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013
An Introduction to Statistical Learning with Applications in R
New York: Springer, Ch. 9

4 Supervised Learning 3: Advanced methods and Deep Learning

Other flexible methods for supervised learning.

We introduce variable selection models, which are especially useful for cases of many more variables than observations (p > N), common in text analysis.

We also explain and demonstrate deep learning, the state of the art tool in artificial intelligence today.

Lecture + Lab

  • Further supervised methods
  • Variable selection methods
    • Lasso, ridge, and elastic nets
  • Deep Learning
  • Ensembles

Mandatory readings

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013
An Introduction to Statistical Learning with Applications in R
New York: Springer, p. 214–228

Chollet, F., and Allaire, J. J. 2017
Deep Learning with R
Manning, Ch 1: What Is Deep Learning?

5 Unsupervised learning

These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values.

We discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, outlier detection, and latent variable models for dimensionality reduction – here, familiarity with Principal Component Analysis and Structural Equation Modelling is an advantage, but not a necessity.

We also cover topic models, a class of unsupervised learning for text data.

Lecture + Lab

  • Clustering
    • K-means
    • Hierarchical clustering
    • Model-based and partial clustering
  • Dimensionality Reduction
    • Principal Components Analysis
  • Topic Models

Mandatory readings

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013
An Introduction to Statistical Learning with Applications in R
New York: Springer, Ch.10, p.373–400

Conway, D., and White, J. M. 2012
Machine Learning for Hackers
Sebastopol, CA
O’Reilly, Ch. 8. p.205–214

Day Readings
1

Cranmer, S. J., and Desmarais, B. 2017. “What Can We Learn from Predictive Modeling” Political Analysis 25(2): 145-166

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 2 and 5.

2

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 8.

3

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 9.

4

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, p. 214-228.

Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer, Ch. 11, 16.

Chollet, F., and Allaire, J. J. 2017. Deep Learning with R. Manning, Ch 1 (What Is Deep Learning?).

5

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 10 (373-400).

Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, CA: O’Reilly, Ch. 8 (p. 205-214).

Software Requirements

Please install R and Rstudio on your laptop prior to the first class.

If you have trouble installing, contact Bruno via email (see bio at the top of this page) or Moodle, Facebook, Twitter... whatever your preferred means of communication.

You are encouraged to use only open-source software throughout the course. All examples and tutorials provided will be in R, but those familiar with other programming languages are welcome to use them.

Hardware Requirements

Please bring your own laptop.

Literature

ML textbooks you can consult along with the main textbook (James et al. 2013), and articles with applications of such models in political science.

Chollet, F., and Allaire, J. J. 2017. Deep Learning with R. Manning, Ch 1 (What Is Deep Learning?).

Conn, D., and Ramirez., C. M. 2016. Random Forests and Fuzzy Forests in Biomedical Research. In: Alvarez, R. M. (ed.). Computational Social Science: Discovery and Prediction. Cambridge: Cambridge University Press, p. 168-97.

Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, CA: O’Reilly.

Cranmer, S. J., and Desmarais, B. 2017. “What Can We Learn from Predictive Modeling” Political Analysis 25(2): 145-166

Grimmer, J., and B. M. Stewart. 2013. “Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3): 267–97.

Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer.

Lantz, B. 2015. Machine Learning with R, 2 ed. Birmingham: Packt Publishing.

Levin, I., Pomares, J., Alvarez, R. M. 2016. Using Machine Learning Algorithms to Detect Election Fraud. In: Alvarez, R. M. (ed.). Computational Social Science: Discovery and Prediction.  Cambridge: Cambridge University Press, p. 266-94.

Witten, I., and Frank, E.. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2 ed. San Francisco: Elsevier.

Samii, C., Paler, L., Daly, S. Z. 2016. Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-recidivism Policies in Colombia. Political Analysis, online first, doi:10.1093/pan/mpw019.

Recommended Courses to Cover Before this One

Introduction to R
Python
Linear Regression
GLM
Advanced Topics in Regression

Recommended Courses to Cover After this One

Automated Web Data Collection with R
Quantitative text analysis
Causal inference with observational data