ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Back to Panel Details
Back to Panel Details

Linear Regression with R/Stata: Estimation, Interpretation and Presentation

Constantin Manuel Bosancianu
manuel.bosancianu@outlook.com

Central European University

Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.

His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.

He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.

  @cmbosancianu

Course Dates and Times

Monday 6 to Friday 10 March 2017
Generally classes are either 09:00-12:30 or 14:00-17:30
15 hours over 5 days

Prerequisite Knowledge

In order to guarantee that we can progress at a steady and firm pace, participants will need to have a thorough understanding of basic statistical concepts such as mean, median, variance, standard deviation and standard error. Additionally, very basic statistical tests and analyses, such as t tests and ANOVA should be familiar to participants at least at a theoretical level. The class will be carried out in both R and Stata, so participants are expected to have basic knowledge of either one of these two software packages: reading in data, basic data recoding skills, and very basic plotting commands.

 


Short Outline

This course will expose participants to the rigorous application of linear regression models. Over five days we go through estimating these models, interpreting their results and judging how well the models fit the data. We will gradually explore more complex specifications, learning how to deal with dichotomous predictors and interactions. We also focus on the assumptions which OLS models are based on, how to check for these in the data at hand, and how to handle situations when they are not met. Finally, we briefly explore how to extend these models to situations of non-linearity and non-continuous outcomes. Throughout the class, we emphasize presenting results as intuitively as possible for our audience, either through graphs or predicted values. This format should serve participants who are interested in a thorough coverage of linear models, both for immediate use and as a stepping stone for more advanced statistical procedures.


Long Course Outline

It is frequently quipped that regression is the most used and abused method in the social sciences. My goal in this class is to expose participants to the abuse-free application of linear regression models to social science data. By the end of the sessions participants will have all the required theoretical and practical skills to responsibly run multivariate linear regressions to a variety of data configurations. This includes estimating multiple model specifications in R or Stata, presenting results in tables, in a graphical format or as informative quantities, and interpreting the coefficients for the reader. Furthermore, it also implies assessing the appropriateness of OLS regression for certain kinds of data distributions, and learning to make suitable corrections and adjustments when there is a mismatch between model requirements and data characteristics.

The course should appeal most to participants who have had statistical training at an introductory level as part of their undergraduate studies, and now wish to deepen it with a rigorous coverage of linear models. Although we will not be computing quantities by hand, there will be a few simple formulas as part of the lectures. In this sense, the class is also suitable for those of you who have briefly encountered OLS regression as part of a statistics class, but now wish to better understand how it works, where it breaks down, and how it can be applied in a thorough way. Due to the need to constantly focus on the application of linear models, the class is unsuitable for those who want an introductory course in general statistics. We briefly cover during one of the sessions some basic statistical concepts and tests, but this is only so that we can all delve into the topic of linear models from an equal footing. This cannot be considered a substitute for a good coverage of introductory statistics.

We start off the class with a condensed review of some fundamental concepts in basic statistics: the z and t distributions, hypothesis testing, confidence intervals and correlation. This short overview intends to provide a solid foundation from which to advance in the following days. In the lab session, we will go through a few of the basic data manipulation procedures that are commonly required before running any regression: data cleaning and recoding, transformations of data etc. This is also a good opportunity for participants to get familiar, if need be, with working with syntax files in R and Stata.

The second session delves fully into the fundamentals of Ordinary Least Squares (OLS) regression: how the estimation is carried out, how we interpret the coefficients and their associated uncertainty, and the best way in which to present results from these analyses. Some basic formulas will be presented, but the goal will be to gain an intuitive understanding of how the estimation process functions, and what the results mean. In the lab session we put this newly-gained knowledge to the test, by running a few examples of linear models in R or Stata. We will interpret the output and the model fit, generate predictions based on the model, and displaying them in a graphical way. For this, we will rely on the functionality provided by the Zelig package in R, and the Clarify add-on in Stata.

In the third session we advance in our coverage of OLS by tackling some model specifications that almost always appear in empirical research. Here I refer to dummy indicators and exploring effect heterogeneity through the use of interactions. We learn how to interpret results obtained from interaction terms, and how to present them in a graphical way. In the lab component we learn how to run these model specifications with the use of R/Stata. The fourth session is devoted to preventing abuse in the estimation of linear models. As with the vast majority of statistical procedures, a series of assumptions underpin OLS regression. If these are not met we have little reason to put our faith in the results we obtain. In this session we go over these assumptions, how they influence the results when they are not met, and what strategies we have available in the toolbox to overcome this situation. In the lab we turn to these issues from a practical perspective. We use R and Stata to run a test regression, assess whether the assumptions are met, and correct for the assumption violations that exist. Through a step-by-step process, participants have the opportunity to see how their estimates and model fit changes when engaging in such a process.

Finally, we devote the last session to extensions of the framework: non-linear and generalized linear models. While OLS is ubiquitous in the social sciences, a considerable number of phenomena are not linear in how their effect is manifested. Getting a taste of non-linear models opens up a new world of hypotheses and results for researchers. In the second half, we discuss generalized linear models (GLMs), which can be used to analyze non-continuous dependent variables: dichotomous, ordinal, or multinomial. Using the test case of a dichotomous outcome variable, we explore the way in which GLMs are a simple extension of the linear framework. In the lab we practice these models and focus on how to interpret their results.

Throughout the class, we will focus on graphical methods and intuitive quantities when presenting results from linear models. Graphs rather than tables, predicted values and uncertainty around them rather than coefficients and standard errors. While tables of coefficients are still the dominant way of presenting results in academic journals, graphs and predicted values tend to be preferred in reports and analyses done for larger, non-technical audiences. I strongly believe that participants should be exposed to both types, and should tailor the delivery of their results to the audience at a particular time.

Day Topic Details
1 From correlation to regression: revisiting the basics

We cover a few foundational concepts in statistics: correlation, standard error, t test, t and z distributions.

In the lab part, we get familiar with R or Stata, and try a few basic data manipulation and transformation tasks. All of these tasks habitually have to be performed before running a regression.

2 OLS fundamentals, coefficients, and graphical displays

We go through the estimation of OLS models, the interpretation of coefficients and their associated uncertainty, and the graphical presentation of results.

In the lab session, we run a few regressions in R or Stata, based on the code supplied by the instructor, and go through interpreting coefficients, uncertainty and model fit once more.

3 Dummy variables, interactions, and graphical displays

We discuss slightly more complex model specifications, including dummy variables, and simple interactions. We learn how to visualize the results from these interactions through standard plots.

In the lab, we go through these more advanced specifications, and do some more plotting of results relying on the Zelig package in R, and the Clarify add-on in Stata. Learning how to properly interpret and present interactions is a vital skill when considering the sheer number of hypotheses in the social sciences that can be tested through the judicious use of interactions.

4 Regression assumptions, violations and remedies

This session covers the assumptions underpinning OLS regression, what the implications of assumption violations are, and how to correct for them.

The lab session will offer practical strategies of identifying assumption violations, and overcoming some of them through data transformations. We also see how estimates and model fit statistics change when correcting for some of these violations.

5 Beyond linear models: non-linear specifications and GLMs

In this last session, we review a few of the most important ideas covered in the past four days and extend the linear framework to non-linear specifications and non-continuous outcome variables. This is meant to offer participants a few glimpses into the potential of statistical models not extremely different from OLS regression to cover a variety of more complex hypotheses and data configurations.

In the lab we run a few very simple non-linear specifications, and a straightforward logistic regression.

Day Readings
1
  • Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the Practice of Statistics. 6th edition. New York: W. H. Freeman and Co. Chapters 5, 6, 7 and 8.

(NB: There is no need to read each of the chapters carefully. Please only focus on the topics that you feel you might need a brush up on. The rest of the topics can be merely skimmed.)

2
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 5 and 6.

Optional:

  • Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s Practice What We Preach: Turning Tables into Graphs. The American Statistician, 56(2), 121–130.
3
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 7.
  • Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.

Optional:

  • Hardy, M. A. (1993). Regression with Dummy Variables. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 3.
  • Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 2.
4
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 11 and 13.

Optional:

  • Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 5. [technical at times, but a thorough treatment]
  • King, G., & Roberts, M. E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 23(2), 159–179. [the paper is a bit technical in some of the parts, so maybe skip those more mathematical sections, and try going for sections 1, 6 and 7.]
5
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 14 and 17.

Optional:

  • Motulsky, H. J., & Ransnas, L. A. (1987). “Fitting curves to data using nonlinear regression: a practical and nonmathematical review.” The FASEB Journal, 1(5), 365–374.
  • Pampel, F. C. (2000). Logistic Regression: A Primer. Quantitative Applications in the Social Sciences Series. London: Sage. Chapters 1 and 2.

Software Requirements

R 3.3.2 or any newer version.

Stata 13.0 or any newer version.

RStudio 0.99.1266 or any newer version

Hardware Requirements

Any computer or laptop bought within the last 4-5 years should be sufficient. 2 GB of RAM and 200-300 MB of free space on the hard drive are enough for running the tasks we will attempt.

Literature

I have tried, as much as possible, to assign chapters from the same textbook, so as to minimize disruptions in logic and in the way the topics are approached. However, if you encounter difficulties in tracking down the literature above, please try the sources below as well. Some are more advanced, though, and choose to present the topic in a more mathematical way.

 

  1. Achen, C. H. (1982). Interpreting and Using Regression. Quantitative Applications in the Social Sciences Series. London: Sage. [a classic text, and very intuitive]
  2. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression / Correlation Analysis for the Behavioral Sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates. Chapters 2, 3, 4, 5, 6, 7, 8, 9, and 10. [a more advanced treatment of regression analysis]
  3. Fox, J. (1991). Regression Diagnostics. Quantitative Applications in the Social Sciences Series. London: Sage.
  4. Kam, C. D., & Franzese, R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press. Chapters 3 and 4.
  5. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. 5th edition. New York: McGraw-Hill.
  6. Lewis-Beck, M. S. (1980). Applied Regression: An Introduction. Quantitative Applications in the Social Sciences Series. London: Sage. [another classic]
  7. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis. 5th edition. New York: Wiley. Chapters 2, 3, 4, 6, 9, 12, and 13 [the book is targeted at a more advanced audience, and is fairly formula-heavy]

 

For assistance with running regressions in R / Stata, please try the following books:

  1. Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression. 2nd edition. London: Sage.
  2. Hamilton, L. C. (2013). Statistics with Stata: Version 12. 8th edition. Boston, MA: Cengage.
  3. Rabe-Hesketh, S., & Everitt, B. (2004). A Handbook of Statistical Analyses Using Stata. 3rd edition. Boca Raton, FL: Chapman & Hall.

Recommended Courses to Cover Before this One

<p><strong>Summer School</strong></p> <p>Introduction to R</p> <p>Introduction to Stata</p> <p>&nbsp;</p> <p><strong>Winter School</strong></p> <p>Introduction to R</p> <p>Introduction to Stata</p> <p>Introduction to Statistics for Political and Social Scientists</p>

Recommended Courses to Cover After this One

<p><strong>Summer School</strong></p> <p>Intro to GLM: Binary, Ordered and Multinomial Logistic, and Count Regression</p> <p>Applied Topics in Advanced Regression</p> <p>&nbsp;</p> <p><strong>Winter School</strong></p> <p>Interpreting Binary Logistic Regression Models</p>


Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Conveners

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.