ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Advanced Topics in Applied Regression

Course Dates and Times

Monday 31 July - Friday 4 August

09:00-12:30

Please see Timetable for full details.

Constantin Manuel Bosancianu

manuel.bosancianu@outlook.com

Central European University

The course provides a set of tools that can be employed in situations when standard OLS estimation does not produce adequate estimates. Weighted Least Squares and cluster-corrected standard errors are offered as solutions to the problem of heteroskedasticity, which can severely impact estimates of uncertainty in standard OLS. Interactions can be of help in instances when we have reason to suspect that effects vary across subgroups in the population. Nonlinear regression can handle situations when the relationship between two variables has a more complex form than a simple straight line. Finally, certain types of robust regression serves to produce proper estimates even in situations of data outliers. All these topics are discussed both theoretically, in the lectures, as well as practically, with the use of actual social science data sets and the R statistical environment. The topics represent a middle ground between the standard linear regression framework and more advanced GLM procedures, or the multilevel modeling framework.

 

 


Instructor Bio

Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.

His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.

He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.

  @cmbosancianu

As an estimation framework, OLS (Ordinary Least Squares) has clear advantages over alternatives when it comes to ease of understanding, speed, elegance and robustness in the face of distributional “irregularities” in data. At the same time, the instances when OLS can safely be employed in a standard way with social science data are rather few and far between. This class aims to expand the statistical toolbox of participants by presenting a set of regression tools that can be applied in situations when standard OLS would lead to suboptimal results.

We start with an in-depth coverage of OLS assumptions, emphasizing in particular the need for continuous (or dichotomous) measures in our regressions, a normal distribution of residuals, homoskedasticity, and linear relationships. We discuss, in turn, how OLS estimates of effect and uncertainty are impacted by violations of these assumptions, and what tools we have available in R to diagnose these problems. I make the point that these assumptions are frequently not met in the course of many analyses, leading to biased estimates and, therefore, shaky conclusions. The topics that follow in the course represent a few of the modeling strategies available to researchers when such assumptions are clearly invalid. While more complex than OLS, they are nevertheless still part of the “least squares” framework, and somewhat easier to apply than more complex procedures, like multilevel modeling.

We first discuss heteroskedasticity: what its implications are for estimates, how it can be detected in the course of a standard analysis, and how commonly it appears as a problem. To address this issue, I present two potential solutions. The first is cluster-corrected standard errors, which might remedy estimates of uncertainty if the underlying problem is generated by clustering in the data. Cluster-corrected SEs continue to be a very popular approach in a variety of disciplines, which is why they are covered in depth here. The second, more general, solution is the use of Weighted Least Squares (WLS). Both subtopics are discussed from a theoretical perspective, as well as in a practical setting, in the laboratory. The third day of the course is taken up by the issue of effect heterogeneity across different subpopulations in the sample. In practice, this will involve an in-depth discussion of interactions in linear models. We cover two-way and three-way interactions, both for continuous and dichotomous predictors, as well as how to present marginal effects in a graphical way. As we will see, interactions are frequently a source of confusion in published work, and continue to frequently be misinterpreted. In the final part of the day, we discuss how the interaction framework might be applied when the subpopulations are actually samples from different countries, through the use of fixed effects. As in the previous days, the theoretical coverage is followed by applied lab work, using R and empirical data.

We continue, in the fourth session, with how to model non-linear relationships between variables. Rather than use statistical transformations of the predictors, we rely in this section on polynomials and regression splines. I show how these tools allow the researcher to model increasingly complex relationships between predictors and outcomes. We conclude the section with a presentation of nonlinear least squares, as an estimation method. The final section is taken up by the topic of robust regression. Unlike the second day, though, we refer to regression estimates robust to data outliers, rather than robust to violations of homoskedasticity. Data outliers can severely impact both the magnitude of the regression coefficients and the standard errors produced. We cover three types of procedures in this session: M estimation, bounded-influence regression, and quantile regression. We take up this discussion again in the laboratory, with practical examples of the differences between standard OLS estimation and robust regression in the presence of outliers.

By the end of the course participants should be able to recognize the situations where OLS does not produce adequate estimates, and to identify the specific cause(s) for this breakdown. After this diagnosis, they should be able to either re-estimate using a more appropriate model specification, or to apply the needed corrections to the initial estimates. Finally, they should feel capable of interpreting the estimates from the revised models, and of summarizing the procedures they have implemented in a concise way.

The topic of how to go beyond the standard assumptions of OLS is much broader than the topics covered here. Due to lack of time, I cannot also offer even an overview of a variety of resampling methods (bootstrapping, jackknifing), or of how to correct for estimation problems arising from the absence of full information on some variables (missing data). Those who are interested in these particular topics for their research should aim for one of the other courses offered at either the Winter or the Summer School, dealing with these specific topics. Despite its close connection with the issues of heteroskedasticity, and of varying effects across different subpopulations, I also cannot cover the topic of multilevel modeling. Suitable courses for MLM are offered both in the Winter and Summer editions of the School, and I encourage those interested in the topic to take either of them.

 

Due to the advanced nature of the course I expect participants to have good knowledge of linear regression. They ought to be familiar with running such a regression, interpreting coefficients and standard errors, assessing model fit, and diagnosing problems with the estimation procedure. In addition to this foundation of statistical information, participants are also required to have good knowledge of the R statistical environment. This refers to common procedures such as reading in data, cleaning and recoding variables, as well as running a regression in R and manipulating the output object (e.g. extracting coefficients or standard errors).

 

 

Day Topic Details
Monday OLS assumptions

We discuss regression assumptions, with particular emphasis on four: continuous predictors, normal distribution of errors, homoskedasticity, and linear relationships.

Diagnostic tools for each of these four assumptions.

The effect of assumption violations on estimates.

In the lab, we cover these points in R, with particular focus on diagnostic tools available.

Tuesday Addressing heteroskedasticity: cluster-corrected SEs and WLS

Discussion of the impact of heteroskedasticity on OLS estimates.

Cluster-corrected SEs (Huber–White) as a solution to heteroscedasticity.

Where cluster-corrected SEs do not work.

Weighted Least Squares, in cases of either known or unknown variance structure.

In the lab we cover both strategies in R, with special attention given to cluster-corrected SEs.

Wednesday Effect heterogeneity: interactions and fixed-effects

Interactions in linear regression: two-way and three-way specifications.

Overview of interpretation for different types of interaction: continuous × continuous, dichotomous × continuous, dichotomous × dichotomous.

Graphical methods of presenting marginal effects from interactions.

Interpreting main effects in linear models with interactions.

Special case: fixed-effects to model effect heterogeneity.

Thursday Nonlinear regression: polynomials and splines

Non-linear relationships in OLS.

Data transformations as solution to non-linearity.

Modeling non-linearity directly: (I) Polynomials in regression.

Modeling non-linearity directly: (II) Regression spines.

Friday Robust regression

The impact of outliers on OLS estimates.

Diagnostics for outliers.

Robust regression: (I) M estimation.

Robust regression: (II) bounded-influence regression.

Robust regression : (III) quantile regression.

Day Readings
Monday

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 12: “Diagnosing non-normality, nonconstant error variance, and nonlinearity” (pp. 267–306)

Tuesday

Wooldridge, J. M. (2013). Introductory Econometrics: A Modern Approach, 5th edition. Mason, OH: Cengage Learning. Chapter 8: “Heteroskedasticity” (pp. 268–302)

Wednesday

Kam, C. D., & Franzese Jr., R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: The University of Michigan Press. Chapter 3 (“Theory to practice”) and Chapter 4 (“The meaning, use, and abuse of some common general-practice rules”), pp. 13–102.

Thursday

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 17: “Nonlinear regression” (pp. 451–475).

Motulsky, H. J., & Ransnas, L. A. (1987). “Fitting curves to data using nonlinear regression: a practical and nonmathematical review.” The FASEB Journal, 1(5), 365–374

Friday

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 11: “Unusual and influential data” (pp. 241–266).

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 19: “Robust regression” (pp. 530–547).

Software Requirements

R version 3.3.2 (or newer).

Rstudio version 1.0.136 (or newer).

Hardware Requirements

At least a Pentium Core 2 Duo processor, and a machine with minimum 2 GB of RAM. Around 300-400 MB of free HDD space, for installing additional R packages and storing data. Any laptop bought after 2011 ought to be fine in terms of these minimum requirements.

Literature

Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences. Thousand Oaks, CA: Sage Publications.

Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.

Braumoeller, B. F. (2004). Hypothesis Testing and Multiplicative Interaction Terms. International Organization, 58(4), 807–820.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. Mahwah, NJ: Lawrence Erlbaum Associates. See particularly chapters 4, 6, 7, 9, and 10.

Hao, L., & Naiman, D. Q. (2007). Quantile Regression. London: Sage Publications.

Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression (2nd ed.). London: Sage Publications.

Ritz, C., & Streibig, J. C. (2008). Nonlinear Regression with R. New York: Springer.

Ryan, T. P. (2008). Modern Regression Methods (2nd ed.). Hoboken, NJ: Wiley. See particularly chapters 2, 6, 8, 11, and 13.

Sheather, S. J. (2009). A Modern Approach to Regression with R. New York: Springer. See chapters 3 and 4.

Weisberg, S. (2005). Applied Linear Regression (3rd ed.). Hoboken, NJ: Wiley-Interscience. See chapter 11.

Recommended Courses to Cover Before this One

Summer School

  • Introduction to the use of R
  • Introduction to regression analysis

Winter School

  • Introduction to the use of R

Recommended Courses to Cover After this One

Summer School

  • Intro to GLM: Binary, Ordered and Multinomial Logistic, and Count Regression Models
  • Applied Multilevel Regression Modeling

Winter School

  • Applied Multilevel Regression Modelling
  • Handling Missing Data
  • Interpreting Binary Logistic Regression Models