ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Gentle Introduction to Machine Learning for the Social Sciences

Course Dates and Times

Monday 7 - Friday 11 August

09:00 - 12:30

Please see Timetable for full details.

Bruno Castanho Silva

b.paula.castanho.e.silva@fu-berlin.de

Freie Universität Berlin

Machine Learning is an analytical approach in which users can build statistical models that “learn” from data to make accurate predictions and decisions. From customer-recommendation systems (think of Netflix suggesting what movies you should watch) to policy design and implementation, machine learning algorithms are becoming ubiquitous in a big data world. Their potential, however, is only starting to be explored in the social sciences, and still in few and specific areas. In this course students will learn the fundamentals of machine learning as a data analysis approach, and will have an overview of the most common and versatile classes of ML algorithms in use today. At the end, students will be able to identify what kind of technique is more suitable for their question and data, and how to design, test and interpret their models. They will also be equipped with sufficient basic knowledge to proceed independently for more advanced algorithms and problems. This is an introductory course, so that math and programming technicalities will be kept to a minimum. If you can run and interpret multivariate regressions in R, you can (and should!) take this course.


Instructor Bio

Bruno Castanho is a postdoctoral researcher at the Cologne Center for Comparative Politics, University of Cologne.

He holds a PhD in political science from Central European University and has experience teaching various topics in research design and methodology, including causal inference, machine learning, and structural equation modelling. 

Bruno has written a textbook on Multilevel Structural Equation Modelling in collaboration with fellow Winter School instructors.

  @b_castanho

Machine learning in the social sciences is still in its infancy. It has been recently used in specific areas such as policy evaluation, electoral fraud detection, election forecasting, and quantitative text and social media analysis, for example, but is still not widely known or understood even by many skilled quantitative-oriented social scientists. Given its rise in related disciplines (such as economics, as The Economist reports here: http://econ.st/2g8g8Nk), however, it must be expected that machine learning will be a new trend in quantitative social science analysis in the coming years. This course offers the basic toolkit for students to be able to understand, evaluate, and build their own machine learning models for research and applied problems.  ML algorithms will be introduced conceptually, and we will not go over the math behind their workings. All are available as off-the-shelf packages in R, so students will not be required to code them from scratch, but only to use ready R functions.

The first day will be a general introduction into the logic of machine learning, and will situate students into how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effect over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, supervised vs. unsupervised learning, and the Bayes classifier as an ideal goal. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbor classifier to illustrate those conceptual points.

Days 2 and 3 deal with supervised learning, the class of problems in which we know the value of the dependent variable for at least part of our data, meaning we can train models and evaluate the performance of their prediction against actually observed values. Day 2 will discuss classification problems, some of the most common problems for which ML is applied. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc. We will cover some of the most common algorithms in use, such as linear discriminant analysis (LDA), support vector machines (SVM), and logistic regression with boosting.

In day 3 we move to classification and regression problems using two other very common and versatile classes of algorithms: tree-based ones, and variable-selection methods. Tree-based algorithms (decision trees, random forests, and bagging), have been often used in diagnosis and effective decision making, but are also excellent for social scientists because they give interpretable explanatory models. They allow users to identify what are the most important variables for predicting the outcome (substantive importance, not statistical significance), what means they can be applied to any problem in which linear regression could be used. Variable selection models are regression models in which independent variables with small coefficients (meaning, low predictive power), are penalized and effectively removed, retaining only those that substantively help predict/explain the outcome.

In the fourth day we cover unsupervised learning. These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, outlier detection, and latent variable models from a ML perspective – here, familiarity with Principal Component Analysis and Structural Equation Modeling is an advantage, but not a need.

The last day will be dedicated to a brief introduction into more advanced issues in Machine Learning, and will depend to some extent on students’ interests and needs from the course. Topics to be covered, initially, include how these models are applied to cases of many more variables than observations (p > N, common in text analysis), neural networks and deep learning as extremely efficient predictors, and the power of ensembles (learning models with several algorithms and then getting their joint prediction) for more accurate predictions.

At the end, students will be able to identify what kind of machine learning method better suits the data and question they have at hand (whether it is a classification or regression problem, if supervised or unsupervised, etc), and to fit and interpret the appropriate models covered in class. Students who want full credits will have to submit daily take-home exercises.

This is an introductory course for social science students with no prior knowledge of machine learning or algorithms at large. Those who are familiar with and have used these methods already, or who have experience with computational sciences, programming, and software development, are likely to get frustrated with the basic level of classes.

This course will be taught from a very beginner’s perspective to statistical learning. Students are only expected to be familiar with concepts and practices of traditional multivariate regression analysis. All examples will be given in R, so students should also have a minimal working knowledge of it (which can be replaced by an advanced knowledge of another programming language, such as Python, to adapt the codes and exercises on your own).

Day Topic Details
Monday General introduction to ML

Lecture + Lab

  • Conceptual issues (and how ML relates to conventional statistical methods used in social sciences):
    • Prediction and explanation;
    • Supervised and unsupervised learning;
    • Cross-validation and overfitting;
    • Bias/variance trade-off;
    • Hyperparameters and tuning;
    • The Bayes classifier;
  • K-nearest neighbor classifier to illustrate these issues;
Tuesday Classification problems

Lecture + Lab

  • Linear Discriminant Analysis
  • Support vector machine
  • Logistic regression and boosting
  • Introduction to Decision trees
Wednesday Classification and regression problems

Lecture + Lab

  • Tree-based methods
    • regression trees
    • Fast and frugal trees
    • Bagging
    • Random forest
  • Variable selection methods
    • Lasso, ridge, and elastic nets
Thursday Unsupervised learning

Lecture + Lab

  • Clustering
    • K-means
    • Hierarchical clustering
  • Dimensionality Reduction
    • Principal Components Analysis
  • Outlier detection
Friday Advanced topics

Lecture + Lab

Depends on the progress made so far, and on specific students’ interests. A non-exhaustive list of possible topics is:

  • Ensemble learning
  • Applications in text analysis (and other p > N problems)
  • Neural networks and perceptrons
  • Deep learning
Day Readings
Monday

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 2 (p. 15-51).

Domingos, P. 2012. A Few Useful Things to Know about Machine Learning. Communications ACM 55(10): 78-87.

Tuesday

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 4 (p. 127-153) and Ch. 9 (p. 337-355)

Wednesday

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, p. 214-228 and Ch. 8 (p. 303-323)

Thursday

James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 10 (373-400)

Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, CA: O’Reilly, Ch. 8 (p. 205-214).

Friday

Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer, Ch. 11, 16, and 18

Software Requirements

Students are encouraged to use only open-source software throughout the course. All examples and tutorials provided will be in R, but those familiar with other programming languages are welcome to use them.

R 3.2.3 (or higher)

Rstudio 1.0.44 (or higher)

Hardware Requirements

All students should bring their own laptops to class.

Literature

The following are textbooks on Machine learning that students can consult along with the main textbook used in class (James et al. 2013), and articles with applications of such models in political science.

Conn, D., and Ramirez., C. M. 2016. Random Forests and Fuzzy Forests in Biomedical Research. In: Alvarez, R. M. (ed.). Computational Social Science: Discovery and Prediction. Cambridge: Cambridge University Press, p. 168-97.

Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, CA: O’Reilly.

Grimmer, J., and B. M. Stewart. 2013. “Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3): 267–97.

Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer.

Lantz, B. 2015. Machine Learning with R, 2 ed. Birmingham: Packt Publishing.

Levin, I., Pomares, J., Alvarez, R. M. 2016. Using Machine Learning Algorithms to Detect Election Fraud. In: Alvarez, R. M. (ed.). Computational Social Science: Discovery and Prediction.  Cambridge: Cambridge University Press, p. 266-94.

Witten, I., and Frank, E.. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2 ed. San Francisco: Elsevier.

Samii, C., Paler, L., Daly, S. Z. 2016. Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-recidivism Policies in Colombia. Political Analysis, online first, doi:10.1093/pan/mpw019.

Recommended Courses to Cover Before this One

Summer School

  • Introduction to the Use of R
  • Python Programming for Social Scientists: Web Data, Scraping and Other Useful Programming Tricks

Winter School

  • Linear Regression with R/Stata: Estimation, Interpretation and Presentation

Recommended Courses to Cover After this One

Winter School

  • Automated Web Data Collection with R
  • Introduction to Quantitative Text Analysis
  • Methods of Modern Causal Analysis Based on Observational Data