ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Linear Regression with R/Stata: Estimation, Interpretation and Presentation

Constantin Manuel Bosancianu
manuel.bosancianu@outlook.com

Central European University

Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.

His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.

He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.

  @cmbosancianu

Course Dates and Times

Monday 5 to Friday 9 March 2018
09:00-12:30
15 hours over 5 days

Prerequisite Knowledge

In order to guarantee that we can progress at a steady and firm pace, participants will need to have a thorough understanding of basic statistical concepts such as mean, median, variance, standard deviation and standard error. Additionally, very basic statistical tests and analyses, such as t tests and ANOVA should be familiar to participants at least at a theoretical level. The class will be carried out primarily in R, but with Stata examples and scripts as well. Participants are expected to have basic knowledge of either one of these two software packages: reading in data, basic data recoding skills, and very basic plotting commands.

 


Short Outline

This course will expose participants to the rigorous application of linear regression models. Over five days we go through estimating these models, interpreting their results and judging how well the models fit the data. We will gradually explore more complex specifications, learning how to deal with dichotomous predictors and interactions. We also focus on the assumptions which OLS models are based on, how to check for these in the data at hand, and how to handle situations when they are not met. Throughout the class, we emphasize presenting results as intuitively as possible for our audience, either through graphs or predicted values. This format should serve participants who are interested in a thorough coverage of linear models, both for immediate use and as a stepping stone for more advanced statistical procedures. The class will be conducted primarily in R, but I will also share Stata code for all procedures and models.

 

Tasks for ECTS Credits

  • Participants attending the course: 2 credits (pass/fail grade) The workload for the calculation of ECTS credits is based on the assumption that students attend classes and carry out the necessary reading and/or other work prior to, and after, classes.
  • Participants attending the course and completing one task (see below): 3 credits (to be graded)
  • Participants attending the course, and completing two tasks (see below): 4 credits (to be graded)
     
  1. 3 credits: class attendance + readings + a take-home assignment during the course;
  2. 4 credits: class attendance + readings + take-home assignment + final paper.

The take-home assignment will be sent out on Tuesday, March 6th, with a Thursday afternoon deadline. Participants will be provided with some data by the instructor, as well as with a few model specifications which need to be estimated with this data. You will be expected to interpret a few coefficients from these models, their uncertainty, as well as to make a few qualitive decisions as to which model is the best and what recommendations you can make based on the results.

The final paper will resemble a conference paper, with the exception that the literature review section can be just 2-3 paragraphs where you present the puzzle. Participants should identify a few hypotheses they are interested in testing, and then proceed to test them based on data of their choosing. The main parts I am interested in are the variable description (not more than a couple of pages), the analyses, and the interpretation of the results. You will be assessed based on how well you have interpreted coefficients and the model fit, whether you have explored potential problems with these models (in terms of assumption violations), and how well you were able to correct said problems. The deadline for this assignment will be 15-20 days after the end of the Winter School.

For both assignments, more details about the tasks and requirements will be provided during the course itself.


Long Course Outline

It is frequently quipped that regression is the most used and abused method in the social sciences. My goal in this class is to expose participants to the abuse-free application of linear regression models to social science data. By the end of the sessions participants will have all the required theoretical and practical skills to responsibly run multivariate linear regressions to a variety of data configurations. This includes estimating multiple model specifications in R or Stata, presenting results in tables, in a graphical format or as informative quantities, and interpreting the coefficients for the reader. Furthermore, it also implies assessing the appropriateness of OLS regression for certain kinds of data distributions, and learning to make suitable corrections and adjustments when there is a mismatch between model requirements and data characteristics.

The course should appeal most to participants who have had statistical training at an introductory level as part of their undergraduate studies, and now wish to deepen it with a rigorous coverage of linear models. Although we will not be computing quantities by hand, there will be a few simple formulas as part of the lectures. In this sense, the class is also suitable for those of you who have briefly encountered OLS regression as part of a statistics class, but now wish to better understand how it works, where it breaks down, and how it can be applied in a thorough way. Due to the need to constantly focus on the application of linear models, the class is unsuitable for those who want an introductory course in general statistics. We briefly cover during one of the sessions some basic statistical concepts and tests, but this is only so that we can all delve into the topic of linear models from an equal footing. This cannot be considered a substitute for a good coverage of introductory statistics.

We start off the class with a condensed review of some fundamental concepts in basic statistics: the z and t distributions, hypothesis testing, confidence intervals and correlation. This short overview intends to provide a solid foundation from which to advance in the following days. We already begin to discuss now a few basics of regression, e.g. how it goes beyond correlation, and for what type of questions it is helpful. In the lab session, we will go through a few of the basic data manipulation procedures that are commonly required before running any regression: data cleaning and recoding, transformations of data etc. This is also a good opportunity for participants to get familiar, if need be, with working with syntax files in R and Stata, and with the Stata or Rstudio interfaces.

The second session delves fully into the fundamentals of Ordinary Least Squares (OLS) regression: how the estimation is carried out, and how we interpret the coefficients both for simple (one predictor) and multiple regression (two or more predictors).. Some basic formulas will be presented, but the goal will be to gain an intuitive understanding of how the estimation process functions, and what the results mean. In the lab session we put this newly-gained knowledge to the test, by running a few examples of linear models in R or Stata. We will interpret the output and the model fit, and generate predictions based on the model using the functionality provided by the Zelig package in R, and the Clarify add-on in Stata.

In the third session we advance in our coverage of OLS by tackling the issue of uncertainty of estimates, as well as some model specifications that almost always appear in empirical research. In the latter case I refer to dummy indicators (categorical predictors). We will discuss where estimated uncertainty comes from, how it impacts your results, what influences uncertainty, and how you ought to communicate it to your audience. In the lab component we learn how to run these model specifications with the use of R/Stata. The fourth session is devoted to preventing abuse in the estimation of linear models. As with the vast majority of statistical procedures, a series of assumptions underpin OLS regression. If these are not met we have little reason to put our faith in the results we obtain. In this session we go over these assumptions, how they influence the results when they are not met, and what strategies we have available in the toolbox to overcome this situation. In the lab we turn to these issues from a practical perspective. We use R to run a test regression, assess whether the assumptions are met, and correct for the assumption violations that exist. Through a step-by-step process, participants have the opportunity to see how their estimates and model fit changes when engaging in such a process.

Finally, we devote the last session  mainly to a recap of the most important elements in the regression framework, based on the participants’ needs and requests. We also delve into the various ways in which regression results can be presented, depending on the audience of your work: quantitative scholars, policymakers, or a general audience. As a slightly more advanced topic, I conclude with a presentation of how interactions can expand considerably the range of hypotheses you can test with regression. In the lab I intend to present code for producing tables and graphs with regression estimates, as well as demonstrate how an interaction effect between predictors can be set up, estimated, and evaluated.

Throughout the class, we will focus on graphical methods and intuitive quantities when presenting results from linear models. Graphs rather than tables, predicted values and uncertainty around them rather than coefficients and standard errors. While tables of coefficients are still the dominant way of presenting results in academic journals, graphs and predicted values tend to be preferred in reports and analyses done for larger, non-technical audiences. I strongly believe that participants should be exposed to both types, and should tailor the delivery of their results to the audience at a particular time.

Day Topic Details
1 From correlation to regression: revisiting the basics

We cover a few foundational concepts in statistics: correlation, standard error, t test, t and z distributions. We also make our first forays into the regression setup.

In the lab part, we get familiar with R or Stata, and try a few basic data manipulation and transformation tasks. All of these tasks habitually have to be performed before running a regression.

2 OLS fundamentals, coefficients, and graphical displays: coefficients and model fit.

We go through the estimation of OLS models and the interpretation of coefficients for simple and multiple regression.

In the lab session, we run a few regressions in R or Stata, based on the code supplied by the instructor, and go through interpreting coefficients and measures of model fit once more. We also introduce a way of presenting effect sizes based on the Zelig package in R, and the Clarify add-on in Stata.

3 Dummy variables, interactions, and graphical displays and uncertainty of estimates.

We discuss slightly more complex model specifications which include dummy variables The bulk of the class, though, is devoted to understanding and interpreting uncertainty in our regression estimates.

In the lab, we go through additional regression models, which involve dummies. Most of our empirical efforts, though, will be allocated to understanding where uncertainty in estimates comes from, how we can minimize it, and how we can responsibly present it to the audience.

4 Regression assumptions: violations and remedies.

This session covers the assumptions underpinning OLS regression, what the implications of assumption violations are, and how to correct for them.

The lab session will offer practical strategies of identifying assumption violations, and overcoming some of them through data transformations. We also see how estimates and model fit statistics change when correcting for some of these violations.

5 Beyond linear models: non-linear specifications and GLMs Recap, multiplicative interactions in regression, and graphical presentations.

In this last session, we review a few of the most important ideas covered in the past four days, based on participants’ requests. I also introduce a way to test more sophisticated hypotheses, about how effects of a predictor vary, through the use of interactions. Finally, I show a few of the ways in which regression results can be presented to the audience, and discuss the strengths and weaknesses of each.

In the lab I show code for interactions, graphical presentations of results, and also allow for a recap of any topics participants feel we should cover again.

Day Readings
1
  • Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the Practice of Statistics. 6th edition. New York: W. H. Freeman and Co. Chapters 5, 6, 7 and 8.
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 2.

(NB: For Moore et al (2009), there is no need to read each of the chapters carefully. Please only focus on the topics that you feel you might need a brush up on. The rest of the topics can be merely skimmed. For Fox (2008), focus more on sections 1 and 3 of the chapter – even there, not the sophisticated terms, just the general ideas and logic of the procedure.)

2
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 5.

Optional:

  •  
  • Lewis-Beck, Michael S. 1980. Applied Regression – An Introduction. Quantitative Applications in the Social Sciences Series, Vol. 22. London: Sage. Chapter 1 and 3 (only pp. 47-51).

Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 1.

3
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 6.
  • Hardy, M. A. (1993). Regression with Dummy Variables. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 3.

Optional:

  • Lewis-Beck, Michael S. 1980. Applied Regression – An Introduction. Quantitative Applications in the Social Sciences Series, Vol. 22. London: Sage. Chapter 2 and 3 (only pp. 51-52 and 66-71).
  • Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 2.
4
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 11 and 12.

Optional:

  • Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 5. [technical at times, but a thorough treatment]
  • Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 3.

King, G., & Roberts, M. E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 23(2), 159–179. [the paper is a bit technical in some of the parts, so maybe skip those more mathematical sections, and try going for sections 1, 6 and 7.]

5
  • Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.
  • Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s Practice What We Preach: Turning Tables into Graphs. The American Statistician, 56(2), 121–130.

Optional:

  • Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 2.

Software Requirements

R 3.4.2 or any newer version.

Stata 13.1 or any newer version.

RStudio 1.0.143 or any newer version

Hardware Requirements

Any computer or laptop bought within the last 4-5 years should be sufficient. 2 GB of RAM and 200-300 MB of free space on the hard drive are enough for running the tasks we will attempt.

Literature

  1. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression / Correlation Analysis for the Behavioral Sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates. Chapters 2, 3, 4, 5, 6, 7, 8, 9, and 10. [a more advanced treatment of regression analysis]
  2. Fox, J. (1991). Regression Diagnostics. Quantitative Applications in the Social Sciences Series. London: Sage.
  3. Kam, C. D., & Franzese, R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press. Chapters 3 and 4.
  4. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. 5th edition. New York: McGraw-Hill.
  5. Lewis-Beck, M. S. (1980). Applied Regression: An Introduction. Quantitative Applications in the Social Sciences Series. London: Sage. [another classic]
  6. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis. 5th edition. New York: Wiley. Chapters 2, 3, 4, 6, 9, 12, and 13 [the book is targeted at a more advanced audience, and is fairly formula-heavy]

For assistance with running regressions in R / Stata, please try the following books:

  1. Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression. 2nd edition. London: Sage.
  2. Hamilton, L. C. (2013). Statistics with Stata: Version 12. 8th edition. Boston, MA: Cengage.
  3. Rabe-Hesketh, S., & Everitt, B. (2004). A Handbook of Statistical Analyses Using Stata. 3rd edition. Boca Raton, FL: Chapman & Hall.

Recommended Courses to Cover Before this One

<p><strong>Summer School</strong></p> <p>Introduction to R</p> <p>Introduction to Stata</p> <p><span style="color:#00000a">Introduction to Inferential Statistics: What you need to know before you take regression</span></p> <p>&nbsp;</p> <p><strong>Winter School</strong></p> <p>Introduction to R</p> <p>Introduction to Stata</p> <p>Introduction to Statistics for Political and Social Scientists</p>

Recommended Courses to Cover After this One

<p><strong>Summer School</strong></p> <p>Intro to GLM: Binary, Ordered and Multinomial Logistic, and Count Regression</p> <p>&nbsp;</p> <p><strong>Winter School</strong></p> <p>Interpreting Binary Logistic Regression Models</p>


Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed at the time of change.

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.