Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 25 February – Friday 1 March, 14:00 – 17:30 (finishing slightly earlier on Friday)
15 hours over five days
Machine Learning is an analytical approach in which users can build statistical models that 'learn' from data to make accurate predictions and decisions.
From customer-recommendation systems (think of Netflix suggesting what movies you should watch) to policy design and implementation, machine learning algorithms are becoming ubiquitous in a big data world.
Their potential, however, is only starting to be explored in the social sciences, and in few and specific areas.
In this course, you will learn the fundamentals of machine learning as a data analysis approach, and will get an overview of the most common and versatile classes of ML algorithms in use today.
By the end of the course, you will be able to identify what kind of technique is most suitable for your research question and data, and how to design, test and interpret your models. You will also be equipped with sufficient basic knowledge to proceed independently for more advanced algorithms and problems.
This is an introductory course, so math and programming technicalities will be kept to a minimum. If you can run and interpret multivariate regressions in R, you can (and should!) take this course.
Tasks for ECTS Credits
2 credits (pass/fail grade) Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.
3 credits (to be graded) As above, plus complete one task (tbc).
4 credits (to be graded) As above, plus complete two tasks (tbc).
Bruno Castanho is a postdoctoral researcher at the Cologne Center for Comparative Politics, University of Cologne.
He holds a PhD in political science from Central European University and has experience teaching various topics in research design and methodology, including causal inference, machine learning, and structural equation modelling.
Bruno has written a textbook on Multilevel Structural Equation Modelling in collaboration with fellow Winter School instructors.
Machine learning in the social sciences is still in its infancy. It has been recently used in areas such as policy evaluation, electoral fraud detection, election forecasting, and quantitative text and social media analysis, for example, but is still not widely known or understood – even by many skilled quantitative-oriented social scientists.
Given its rise in related disciplines (such as economics, as The Economist reports, however, we must expect machine learning to become a new trend in quantitative social science analysis in the coming years.
This course offers the basic toolkit to enable you to understand, evaluate, and build your own machine learning models for research and applied problems. I will introduce ML algorithms conceptually, and we will not go deeply into the math behind their workings. All are available as off-the-shelf packages in R, so you will not be required to code them from scratch, but only to use ready R functions.
Day 1
A general introduction into the logic of machine learning. You will learn how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effects over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, supervised vs. unsupervised learning, and the Bayes classifier as an ideal goal. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbour classifier to illustrate those conceptual points.
The following two days deal with supervised learning, the class of problems in which we know the value of the dependent variable for at least part of our data, meaning we can train models and evaluate the performance of their prediction against actually observed values.
Day 2
We discuss classification problems – those in which the dependent variable is categorical, and some of the most common problems for which ML is applied. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc.
We will cover some of the most common algorithms in use, such as support vector machines (SVM) and tree-based methods (random forests and boosting). Tree-based algorithms have often been used in diagnosis and effective decision making, but are also excellent for social scientists because they give interpretable explanatory models. They allow users to identify the most important variables for predicting the outcome (substantive importance, not statistical significance), which means they can be applied to any problem in which linear regression could be used.
Day 3
We move to regression problems – i.e. continuous outcomes. We look again at tree-based models and introduce variable-selection methods. Variable selection models are regressions in which independent variables with small coefficients (meaning, low predictive power), are penalised and effectively removed, retaining only those that substantively help predict/explain the outcome. These models are especially useful for cases of many more variables than observations (p > N), very common in text analysis.
Day 4
We cover unsupervised learning. These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, and outlier detection. Familiarity with Principal Component Analysis and Structural Equation Modelling is an advantage, but not essential.
Day 5
A brief introduction into more advanced issues in Machine Learning, which will depend to some extent on students’ interests and needs. Topics we will cover, initially, include neural networks and deep learning as extremely efficient predictors, the power of ensembles (learning models with several algorithms and then getting their joint prediction), and how to use machine learning for causal inference in social sciences.
By the end of this course, you will be able to identify what kind of machine learning method better suits the data and question you have at hand (whether it is a classification or regression problem, if supervised or unsupervised, etc), and to fit and interpret the appropriate models covered in class. Students who want full credits must submit daily take-home exercises.
This is an introductory course for social science students with no prior knowledge of machine learning or algorithms at large. Those who are familiar with and have used these methods already, or who have experience with computational sciences, programming, and software development, are likely to get frustrated with the basic level of teaching.
This is a beginner's course. I expect you to be familiar only with concepts and practices of traditional multivariate regression analysis.
All examples will be given in R, so you should also have a minimal working knowledge of this software (which can be replaced by an advanced knowledge of another programming language, such as Python, to adapt the codes and exercises on your own).
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least three hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for three hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
Day 1 | General Introduction to Machine Learning |
The first day will be a general introduction into the logic of machine learning, and will situate students into how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effect over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, supervised vs. unsupervised learning, and the Bayes classifier as an ideal goal. Throughout our discussions we will refer to (and apply in R) the K-nearest-neighbor classifier to illustrate those conceptual points. Lecture + Lab
|
Day 2 | Supervised learning 1: Classification problems |
Day 2 will discuss classification problems, some of the most common problems for which ML can be applied in social sciences. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not, etc. We will cover some of the most common algorithms in use, such as support vector machines (SVM), and tree-based methods. Lecture + Lab
|
Day 3 | Supervised learning 2: Regression problems |
In day 3 we move to regression problems – i.e., continuous outcomes. We look again at tree-based models and introduce variable-selection methods. Variable selection models are regressions in which independent variables with small coefficients (meaning, low predictive power), are penalized and effectively removed, retaining only those that substantively help predict/explain the outcome. These models are especially useful for cases of many more variables than observations (p > N), very common in text analysis. Lecture + Lab
|
Day 4 | Unsupervised learning |
In the fourth day we cover unsupervised learning. These are cases where we know the characteristics of our observations in independent variables, but not on the dependent, so it is not possible to train the model on known outcome values. We will discuss and apply clustering methods, which identify groups of observations in the data given the independent variables, outlier detection, and latent variable models for dimensionality reduction – here, familiarity with Principal Component Analysis and Structural Equation Modeling is an advantage, but not a need. Lecture + Lab
Outlier detection |
Day 5 | Advanced topics |
The last day will be dedicated to a brief introduction into more advanced issues in Machine Learning, and will depend to some extent on students’ interests and needs from the course. Topics to be covered, initially, include neural networks and deep learning as extremely efficient predictors, and the power of ensembles (learning models with several algorithms and then getting their joint prediction) for more accurate predictions, and how to use machine learning for causal inference in social sciences. Lecture + Lab Depends on the progress made so far, and on specific students’ interests. A non-exhaustive list of possible topics is:
Deep learning |
Day | Readings |
---|---|
Day 1 |
James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 2 (p. 15-51). |
Day 2 |
James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 8 and 9. |
Day 3 |
James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, p. 214-228. |
Day 4 |
James, G., Witten, D., Hastie, T., Tibshirani, R. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer, Ch. 10 (373-400). Conway, D., and White, J. M. 2012. Machine Learning for Hackers. Sebstopol. CA: O'Reilly, Ch. 8 (p. 205-214). |
Day 5 |
Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed. New York: Springer, Ch. 11, 16. |
Please instal R and Rstudio installed on your laptop prior to the first class.
If you have trouble with installation, contact Bruno via email (or Moodle, Facebook, Twitter).
Please bring your own laptop.
The following are textbooks you can consult along with the main textbook (James et al. 2013), and articles with applications of such models in political science.
Conn, D. and Ramirez, C. M. 2016
Random Forests and Fuzzy Forests in Biomedical Research
In: Alvarez, R. M. (ed) Computational Social Science: Discovery and Prediction
Cambridge: Cambridge University Press, p. 168–97
Conway, D., and White, J. M. 2012
Machine Learning for Hackers
Sebastopol, CA: O’Reilly.
Grimmer, J., and B. M. Stewart. 2013
Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
Political Analysis 21(3): 267–97.
Hastie, T., Tibshirani, R., Friedman, J. 2009
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed
New York: Springer
Lantz, B. 2015
Machine Learning with R, 2 ed
Birmingham: Packt Publishing
Levin, I., Pomares, J., Alvarez, R. M. 2016
Using Machine Learning Algorithms to Detect Election Fraud
In: Alvarez, R. M. (ed) Computational Social Science: Discovery and Prediction
Cambridge: Cambridge University Press, p. 266–94
Witten, I., and Frank, E.. 2005
Data Mining: Practical Machine Learning Tools and Techniques, 2 ed
San Francisco: Elsevier
Samii, C., Paler, L., Daly, S. Z. 2016
Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-recidivism Policies in Colombia
Political Analysis, online first, doi:10.1093/pan/mpw019
Summer School
Intro to R
Python
Linear regression
General Linear Models
Winter School
Intro to R
Python
Linear regression
General Linear Models
Summer School
Automated Web Data Collection with R
Quantitative Text Analysis
Casual Inference with Observational Data
Winter School
Automated Web Data Collection with R
Quantitative Text Analysis
Casual Inference with Observational Data