ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Advanced Topics in Applied Regression

Course Dates and Times

Monday 5 – Friday 9 August

14:00–15:30 and 16:00–17:30 (ending slightly earlier on Friday)

Martin Mölder

martin.molder@ut.ee

University of Tartu

Linear regression is at the heart of quantitative methods. Take this course to expand your capacities to use this robust workhorse beyond its standard applications, while gaining valuable knowledge about the R statistical programming language.

The course begins with an overview of regression assumptions and what happens when they are violated.

We then move on to understanding interactions between continuous and categorical data.

Thereafter, we consider what to do when one of the main assumptions of OLS – the linearity of relationships – regression is violated.

The fourth day is devoted to regression and resampling (bootstrap and jackknife) to estimate uncertainty.

The class ends with a focus on models as a whole – we look into how the BIC (Bayesian Information Criterion) can be applied for model selection, and merging the results from different models.

Throughout, we will focus on using R to solve these problems, thereby gaining knowledge about this essential statistical tool.

ECTS Credits for this course

2 credits Be active in class and go through the required materials for each day.

3 credits Complete and submit short daily exercises for each class, which reflect essential practical knowledge of the topic.

4 credits As above, plus submit one complete data analysis task in R that addresses some of the issues and uses some of the techniques introduced in this course. Choose an actual data set and a theoretically interesting and relevant problem, perform the analysis and discuss the potential problems of applying regression and their solutions. Ideally, this should be something that you are or have been working on. The submission deadline will be several weeks after the end of the course.

More detailed instructions about the tasks will be provided through the e-learning site for the course.


Instructor Bio

Martin Mölder (PhD in comparative politics) is a researcher Johan Skytte Institute of Political Studies at the University of Tartu, Estonia.

His main research focus is political parties, their ideological and political positions, and the functioning of party systems. He also teaches, among other things, quantitative methods.

Martin has extensive background in the use of R for data management and statistical analysis in the social sciences.

He has taught the following courses at the ECPR Summer School in Methods & Techniques:

  • R Basics 2016 & 2017
  • Intermediate R: Capacities for Analysis and Visualisation 2017, 2018 & 2019
  • Advanced Topics in Applied Regression 2019

  @martinmolder

As an estimation framework, OLS has clear advantages over alternatives when it comes to ease of understanding, speed, elegance and robustness in the face of distributional 'irregularities' in data. At the same time, the instances when OLS can safely be employed in a standard way with social science data are few and far between. This course aims to expand your statistical toolbox by presenting a set of regression tools that can be applied in situations when standard OLS would lead to suboptimal results.


Day 1
We begin with an overview of regression assumptions, like measurement scales, collinearity, homoscedasticity, autocorrelation and linearity. If you are familiar with regression, you probably know what most of these mean. But what happens if they are violated to some extent? What happens when they are violated extensively? Where to draw the line? We use R to simulate examples of these assumption violations and through them test and see how they impact on the results of our regressions.

Day 2
The effects on one variable often depend on the level of another variable or perhaps there is a certain group structure in the data. We focus on specifying and interpreting interactions, including three-way interactions. We'll see how, most often, plotting is the only way to make such interactions interpretable and meaningful and we explore some of the possibilities that R can offer us here. We also focus on how one can use interactions between dummy variables to model groups in data and we conclude with the fixed-effects model, which is the simplest, but often inadequate, approach to modelling the multilevel structure in your data. 

Day 3
Linear regression, as its name implies, assumes that relationships are linear, which is a very strong and often an erroneous assumption. We will be looking into how one can resolve this problem by transforming variables so that the relationships become linear through polynomial regression, which in most cases involves applying quadratic or cubic functions to regressors. We also briefly look at the usefulness of regression splines in modelling non-linear relationships.

Day 4
We look at resampling as a tool for hypothesis testing and the evaluation of bias/uncertainty. While applicable beyond regression, resampling can be used also in this context for estimating the uncertainty of our results when some of the assumptions of regression are not met. We look at bootstrapping – sampling with replacement – and the jackknife – leaving one case out of the analysis at a time – and their implementations in R. The first is especially useful for estimating confidence intervals in a non-parametric manner; the latter can be used to estimate bias in the model and in some cases also provides a simple way to evaluate how much the results of an analysis depend on single cases. For bootstrapping, we consider case resampling and residual resampling.

Day 5
We examine model selection and model averaging, starting from the question of how to assess the overall quality of the model, not just the statistical significance (which is often meaningless) of the coefficients of the variables it includes. Much of the class will focus on the Bayesian Information Criterion (BIC), a common measure of model fit for generalised linear models, among other things, and thus also applicable to linear regression. Through its connection to the Bayes factor, it allows us to compare, given certain assumptions, models to each other in terms of their relative probability of being 'true'. By extension, this also allows us to fit several models, all of which are likely to be true, and combine the results. This can be useful when we do not have a well-defined theory to test (i.e. the assumption of full model specification is violated) and want to give a good, empirically grounded summary of the possible associations in the data.


These various regression assumption violations and handling them are tied together through working with R and using R to exemplify and simulate many of the problems that we go through. We will use materials prepared in R specifically for this course, that we will go through and discuss together. Whenever possible, we look at how one can use R to present problems and results visually (using ggplot), i.e. make visualisations of our models and results. Therefore, the course has a much more hands-on focus than some of the readings might lead you to believe. The equations are only for the connoisseurs, not for the practitioners.

The classes will not have a strict 'theory versus application' divide. We begin each day with a brief discussion of the theoretical/methodological nature of the problem, but for most of the days we will look at how these problems manifest themselves in actual and simulated data, and at how we can use R to overcome such problems and find adequate solutions.

This course builds on the many common problems encountered in linear OLS (Ordinary Least Squares) regression that are often ignored or overlooked.

You should have knowledge of linear regression. If you are reasonably familiar with running OLS, interpreting coefficients and standard errors, assessing model fit, and diagnosing problems with the estimation procedure, then you are more than ready for this course.  

You also need basic knowledge of the R statistical environment (R as a language and the use of an IDE like RStudio). This refers to common procedures such as reading in data, cleaning and recoding variables, as well as running a regression in R.

Day Topic Details
1 OLS regression assumptions – theory and reality.

Review of regression assumptions and diagnostic procedures.

Simulations of assumption violations.

2 Interactions and fixed effects

Two-way and three-way interactions of continuous and categorical variables.

Interpretation of interactions.

Plotting interactions.

Fixed effects models.

3 Nonlinear regression: polynomials and splines

Types on non-linearity.

Data transformations as a solution to non-linearity.

Ploynomials as a solution to non-linearity.

Regression splines.

4 Regression and resampling; bootstrap, jackknife.

Resampling and statistical inference.

Bootstrapping: case and residual resampling.

Jackknife resampling. 

5 Model selection and averaging

Evaluating model fit.

The logic and the usefulness of the Bayesian Information Criterion.

Comparing models through BIC.

Bayesian model averaging.

Day Readings
1

John Fox, J. (1991). Regression Diagnostics: An Introduction. Sage Publications.

2

Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.

Hardy, M. A. (1993). Regression with dummy variables. Sage.

3

Fox, J. (2016). Applied regression analysis and generalized linear models. 3rd Edition. Sage Publications, Chapter 17, Non-Linear Regression.

4

Fox (2016), Chapter 21, Bootstrapping Regression models.

5

Fox (2016), Chapter 22, Model Selection, Averaging, and Validation.

Software Requirements

R and RStudio.

Hardware Requirements

Please bring your own laptop. No special hardware requirements. Any laptop bought after 2010 should be able to perform the tasks that are part of this class.

Literature

The following is a non-comprehensive list of additional literature, where you can find guidance on the topics covered in class.

Some topics addressed in this literature will also be discussed in class, but in order to keep the day-to-day reading load manageable, they are not part of the required readings.

Allison, P. D. (2009). Fixed effects regression models (Vol. 160). Sage publications.

Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences. Thousand Oaks, CA: Sage Publications.

Braumoeller, B. F. (2004). Hypothesis Testing and Multiplicative Interaction Terms. International Organization, 58(4), 807–820.

Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment (Vol. 17). Sage publications.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. Mahwah, NJ: Lawrence Erlbaum Associates.

Hao, L., & Naiman, D. Q. (2007). Quantile Regression. London: Sage Publications.

Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression (2nd ed.). London: Sage Publications.

Kaufman, R. L. (2013). Heteroskedasticity in regression: Detection and correction (Vol. 172). Sage Publications.

Marsh, L. C., & Cormier, D. R. (2001). Spline regression models (Vol. 137). Sage.

Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: A nonparametric approach to statistical inference (No. 94-95). Sage.

Motulsky, H. J., & Ransnas, L. A. (1987). “Fitting curves to data using nonlinear regression: a practical and nonmathematical review.” The FASEB Journal, 1(5), 365–374.

Ritz, C., & Streibig, J. C. (2008). Nonlinear Regression with R. New York: Springer.              

Ryan, T. P. (2008). Modern Regression Methods (2nd ed.). Hoboken, NJ: Wiley.

Sheather, S. J. (2009). A Modern Approach to Regression with R. New York: Springer.

Weisberg, S. (2005). Applied Linear Regression (3rd ed.). Hoboken, NJ: Wiley-Interscience.

Recommended Courses to Cover Before this One

 

Summer School

  • Introduction to the use of R
  • Introduction to regression analysis

Winter School

  • Introduction to the use of R

Recommended Courses to Cover After this One

 

Summer School

  • Intro to GLM: Binary, Ordered and Multinomial Logistic, and Count Regression Models
  • Applied Multilevel Regression Modeling

Winter School

  • Applied Multilevel Regression Modelling
  • Handling Missing Data
  • Interpreting Binary Logistic Regression Models