ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Back to Panel Details
Back to Panel Details

Introduction to Machine Learning for the Social Sciences

Bruno Castanho Silva
bcsilva@wiso.uni-koeln.de

University of Cologne

Bruno Castanho is a postdoctoral researcher at the Cologne Center for Comparative Politics, University of Cologne.

He holds a PhD in political science from Central European University and has experience teaching various topics in research design and methodology, including causal inference, machine learning, and structural equation modelling. 

Bruno has written a textbook on Multilevel Structural Equation Modelling in collaboration with fellow Winter School instructors Constantin Manuel Bosancianu and Levi Littvay.

  @b_castanho


Course Dates and Times

Monday 25 February – Friday 1 March, 14:00 – 17:30 (finishing slightly earlier on Friday)
15 hours over five days

Prerequisite Knowledge

This is a beginner's course. I expect you to be familiar only with concepts and practices of traditional multivariate regression analysis.

All examples will be given in R, so you should also have a minimal working knowledge of this software (which can be replaced by an advanced knowledge of another programming language, such as Python, to adapt the codes and exercises on your own).

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.


Short Outline

Machine Learning is an analytical approach in which users can build statistical models that 'learn' from data to make accurate predictions and decisions.

From customer-recommendation systems (think of Netflix suggesting what movies you should watch) to policy design and implementation, machine learning algorithms are becoming ubiquitous in a big data world.

Their potential, however, is only starting to be explored in the social sciences, and in few and specific areas.

In this course, you will learn the fundamentals of machine learning as a data analysis approach, and will get an overview of the most common and versatile classes of ML algorithms in use today.

By the end of the course, you will be able to identify what kind of technique is most suitable for your research question and data, and how to design, test and interpret your models. You will also be equipped with sufficient basic knowledge to proceed independently for more advanced algorithms and problems.

This is an introductory course, so math and programming technicalities will be kept to a minimum. If you can run and interpret multivariate regressions in R, you can (and should!) take this course.

Tasks for ECTS Credits

2 credits (pass/fail grade) Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.

3 credits (to be graded) As above, plus complete one task (tbc).

4 credits (to be graded) As above, plus complete two tasks (tbc).


Long Course Outline

Machine learning in the social sciences is still in its infancy. It has been recently used in areas such as policy evaluation, electoral fraud detection, election forecasting, and quantitative text and social media analysis, for example, but is still not widely known or understood – even by many skilled quantitative-oriented social scientists.

Given its rise in related disciplines (such as economics, as The Economist reports, however, we must expect machine learning to become a new trend in quantitative social science analysis in the coming years.

This course offers the basic toolkit to enable you to understand, evaluate, and build your own machine learning models for research and applied problems. I will introduce ML algorithms conceptually, and we will not go deeply into the math behind their workings. All are available as off-the-shelf packages in R, so you will not be required to code them from scratch, but only to use ready R functions.


Day 1
A general introduction into the logic of machine learning. You will learn how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effects over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, supervised vs. unsupervised learning, and the Bayes classifier as an ideal goal. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbour classifier to illustrate those conceptual points.

The following two days deal with supervised learning, the class of problems in which we know the value of the dependent variable for at least part of our data, meaning we can train models and evaluate the performance of their prediction against actually observed values.

Day 2
We discuss classification problems – those in which the dependent variable is categorical, and some of the most common problems for which ML is applied. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc.

We will cover some of the most common algorithms in use, such as support vector machines (SVM) and tree-based methods (random forests and boosting). Tree-based algorithms have often been used in diagnosis and effective decision making, but are also excellent for social scientists because they give interpretable explanatory models. They allow users to identify the most important variables for predicting the outcome (substantive importance, not statistical significance), which means they can be applied to any problem in which linear regression could be used.

Day 3
We move to regression problems – i.e. continuous outcomes. We look again at tree-based models and introduce variable-selection methods. Variable selection models are regressions in which independent variables with small coefficients (meaning, low predictive power), are penalised and effectively removed, retaining only those that substantively help predict/explain the outcome. These models are especially useful for cases of many more variables than observations (p > N), very common in text analysis.

Day 4
We cover unsupervised learning. These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, and outlier detection. Familiarity with Principal Component Analysis and Structural Equation Modelling is an advantage, but not essential.

Day 5
A brief introduction into more advanced issues in Machine Learning, which will depend to some extent on students’ interests and needs. Topics we will cover, initially, include neural networks and deep learning as extremely efficient predictors, the power of ensembles (learning models with several algorithms and then getting their joint prediction), and how to use machine learning for causal inference in social sciences.


By the end of this course, you will be able to identify what kind of machine learning method better suits the data and question you have at hand (whether it is a classification or regression problem, if supervised or unsupervised, etc), and to fit and interpret the appropriate models covered in class. Students who want full credits must submit daily take-home exercises.

This is an introductory course for social science students with no prior knowledge of machine learning or algorithms at large. Those who are familiar with and have used these methods already, or who have experience with computational sciences, programming, and software development, are likely to get frustrated with the basic level of teaching.

Day Topic Details
Day 1 General Introduction to Machine Learning

The first day will be a general introduction into the logic of machine learning, and will situate students into how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effect over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, supervised vs. unsupervised learning, and the Bayes classifier as an ideal goal. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbor classifier to illustrate those conceptual points.

Lecture + Lab

  • Conceptual issues (and how ML relates to conventional statistical methods used in social sciences):
    • Prediction and explanation;
    • Supervised and unsupervised learning;
    • Cross-validation and overfitting;
    • Bias/variance trade-off;
    • Hyperparameters and tuning;
    • The Bayes classifier;
  • K-nearest neighbor classifier to illustrate these issues;

 

Day 2 Supervised learning 1: Classification problems

Day 2 will discuss classification problems, some of the most common problems for which ML can be applied in social sciences. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc. We will cover some of the most common algorithms in use, such as support vector machines (SVM), and tree-based methods.

Lecture + Lab

  • Support vector machine
  • Logistic regression and boosting
  • Decision trees and tree-based methods
Day 3 Supervised learning 2: Regression problems

In day 3 we move to regression problems – i.e., continuous outcomes. We look again at tree-based models and introduce variable-selection methods. Variable selection models are regressions in which independent variables with small coefficients (meaning, low predictive power), are penalized and effectively removed, retaining only those that substantively help predict/explain the outcome. These models are especially useful for cases of many more variables than observations (p > N), very common in text analysis.

Lecture + Lab

  • Tree-based methods
    • Boosting
    • Random forest
  • Variable selection methods
    • Lasso, ridge, and elastic nets
Day 4 Unsupervised learning

In the fourth day we cover unsupervised learning. These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, outlier detection, and latent variable models for dimensionality reduction – here, familiarity with Principal Component Analysis and Structural Equation Modeling is an advantage, but not a need.

Lecture + Lab

  • Clustering
    • K-means
    • Hierarchical clustering
  • Dimensionality Reduction
    • Principal Components Analysis

Outlier detection

Day 5 Advanced topics

The last day will be dedicated to a brief introduction into more advanced issues in Machine Learning, and will depend to some extent on students’ interests and needs from the course. Topics to be covered, initially, include neural networks and deep learning as extremely efficient predictors, and the power of ensembles (learning models with several algorithms and then getting their joint prediction) for more accurate predictions, and how to use machine learning for causal inference in social sciences.

Lecture + Lab

Depends on the progress made so far, and on specific students’ interests. A non-exhaustive list of possible topics is:

  • Ensemble learning
  • Causal inference
  • Neural networks

Deep learning

Day Readings
Day 1

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 2 (p. 15-51).

Day 2

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 8 and 9.

Day 3

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, p. 214-228.

Day 4

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 10 (373-400).

Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebstopol. CA: O'Reilly, Ch. 8 (p. 205-214).

Day 5

Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer, Ch. 11, 16.

Software Requirements

Please instal R and Rstudio installed on your laptop prior to the first class.

If you have trouble with installation, contact Bruno via email (or Moodle, Facebook, Twitter).

Hardware Requirements

Please bring your own laptop.

Literature

The following are textbooks you can consult along with the main textbook (James et al. 2013), and articles with applications of such models in political science.

Conn, D. and Ramirez, C. M. 2016
Random Forests and Fuzzy Forests in Biomedical Research
In: Alvarez, R. M. (ed) Computational Social Science: Discovery and Prediction
Cambridge: Cambridge University Press, p. 168–97

Conway, D., and White, J. M. 2012
Machine Learning for Hackers
Sebastopol, CA: O’Reilly.

Grimmer, J., and B. M. Stewart. 2013
Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
Political Analysis 21(3): 267–97.

Hastie, T., Tibshirani, R., Friedman, J. 2009
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed
New York: Springer

Lantz, B. 2015
Machine Learning with R, 2 ed
Birmingham: Packt Publishing

Levin, I., Pomares, J., Alvarez, R. M. 2016
Using Machine Learning Algorithms to Detect Election Fraud
In: Alvarez, R. M. (ed) Computational Social Science: Discovery and Prediction
Cambridge: Cambridge University Press, p. 266–94

Witten, I., and Frank, E.. 2005
Data Mining: Practical Machine Learning Tools and Techniques, 2 ed
San Francisco: Elsevier

Samii, C., Paler, L., Daly, S. Z. 2016
Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-recidivism Policies in Colombia
Political Analysis, online first, doi:10.1093/pan/mpw019

Recommended Courses to Cover Before this One

<p><strong>Summer School</strong></p> <p>Intro to R<br /> Python<br /> Linear regression<br /> General Linear Models</p> <p><strong>Winter School</strong></p> <p>Intro to R<br /> Python<br /> Linear regression<br /> General Linear Models</p>

Recommended Courses to Cover After this One

<p><strong>Summer School</strong></p> <p>Automated Web Data Collection with R<br /> Quantitative Text Analysis<br /> Casual Inference with Observational Data</p> <p><strong>Winter School</strong></p> <p>Automated Web Data Collection with R<br /> Quantitative Text Analysis<br /> Casual Inference with Observational Data</p>


Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Conveners

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.