ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Back to Panel Details
Back to Panel Details

Linear Regression with R/Stata: Estimation, Interpretation and Presentation

Constantin Manuel Bosancianu
manuel.bosancianu@outlook.com

Central European University

Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.

His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.

He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.

  @cmbosancianu

Course Dates and Times

Monday 25 February – Friday 1 March, 14:00 –17:30 (finishing slightly earlier on Friday)
15 hours over 5 days

Prerequisite Knowledge

You should have a thorough understanding of basic statistical concepts such as mean, median, variance, standard deviation and standard error.

You should also be familiar with very basic statistical tests and analyses, such as t tests and ANOVA, at least at a theoretical level.

The class will be carried out primarily in R, but with Stata examples and scripts. You should have basic knowledge of at least one of these software packages for reading in data, basic data recoding skills, and very basic plotting commands, as well as basic familiarity with working with syntax files.


Short Outline

This course will teach you the rigorous application of linear regression models. We will estimate these models, interpret their results and judge how well the models fit the data.

We will gradually explore more complex specifications, learning how to deal with dichotomous predictors and interactions. We also focus on the assumptions on which OLS models are based, how to check for these in the data at hand, and how to handle situations when they are not met.

Throughout the course, we emphasise presenting results as intuitively as possible, either through graphs or predicted values. This format should serve those interested in a thorough coverage of linear models, for immediate use and as a stepping stone to more advanced statistical procedures.

The class will be conducted primarily in R, but I will also share Stata code for all procedures and models.

Tasks for ECTS Credits

2 credits (pass/fail grade) Attend at least 90% of course hours, participate fully in in-class activities, and carry out the necessary reading and/or other work prior to, and after, class.

3 credits (to be graded) As above, plus complete a take-home assignment, sent out on Tuesday 26 February, with a Thursday afternoon deadline. I will provide you with some data, along with a few model specifications which need to be estimated with this data. I expect you to interpret a few coefficients from these models, and their uncertainty, and to make a few qualitive decisions as to which model is the best and what recommendations you can make based on the results.

4 credits (to be graded) As above, plus complete a final paper resembling a conference paper, with the exception that the literature review section can be just 2–3 paragraphs where you present the puzzle. Identify a few hypotheses you are interested in testing, and test them based on data of your choosing.

The main parts I am interested in are the variable description (not more than a couple of pages), the analyses, and the interpretation of the results. You will be assessed on:

  • how well you have interpreted coefficients and the model fit
  • whether you have explored potential problems with these models (in terms of assumption violations)
  • how well you were able to correct said problems.

The deadline for this assignment will be 15–20 days after the end of the Winter School. For both assignments, I will provide more details about the tasks and requirements during the course itself.


Long Course Outline

It is frequently quipped that regression is the most used and abused method in the social sciences. My goal in this class is to expose you to the abuse-free application of linear regression models to social science data.

By the end of the course, you will have all the required theoretical and practical skills to responsibly run multivariate linear regressions to a variety of data configurations. This includes estimating multiple model specifications in R or Stata, presenting results in tables or, in a graphical format, and interpreting the coefficients for the reader. It also implies assessing the appropriateness of OLS regression for certain kinds of data distributions, and learning to make suitable corrections and adjustments when there is a mismatch between model requirements and data characteristics.

The course should appeal most to those who had statistical training at an introductory level as part of their undergraduate studies, and now wish to deepen it with a rigorous coverage of linear models. Although we will not be computing quantities by hand, there will be a few simple formulas as part of the lectures. In this sense, the class is also suitable for those of you who have briefly encountered OLS regression as part of a statistics class, but now wish to better understand how it works, where it breaks down, and how it can be applied in a thorough way. Due to the need to constantly focus on the application of linear models, the class is unsuitable for those who want an introductory course in general statistics. During one of the sessions we briefly cover some basic statistical concepts and tests, but this is only so that we can all delve into the topic of linear models from an equal footing. This cannot be considered a substitute for a good coverage of introductory statistics.

Day 1
We start with a condensed review of some fundamental concepts in basic statistics: the z and t distributions, hypothesis testing, confidence intervals and correlation. This overview is intended to provide a solid foundation from which to advance in the following days. We begin to discuss a few basics of regression, such as how it goes beyond correlation, and for what type of questions it is helpful. In the lab session, we will go through a few of the basic data manipulation procedures commonly required before running any regression: data cleaning and recoding, transformations of data etc. This is a good opportunity for you to get familiar, if need be, with working with syntax files in R and Stata, and with the Stata or Rstudio interfaces.

Day 2
We delve fully into the fundamentals of Ordinary Least Squares (OLS) regression: how the estimation is carried out, and how we interpret the coefficients for simple (one predictor) and multiple regression (two or more predictors). I will present some basic formulas, but the goal will be to gain an intuitive understanding of how the estimation process functions, and what the results mean. In the lab session we put this newly-gained knowledge to the test, by running a few examples of linear models in R or Stata. We will interpret the output and the model fit, and generate predictions based on the model, to present effects in an intuitive way.

Day 3
We advance in our understanding of OLS by tackling uncertainty of estimates, as well as some model specifications that almost always appear in empirical research. In the latter case I refer to dummy indicators (categorical predictors). We will discuss where estimated uncertainty comes from, how it impacts on your results, what influences uncertainty, and how you ought to communicate it to your audience. In the lab component we learn how to run these model specifications with R/Stata.

Day 4
This is devoted to preventing abuse in the estimation of linear models. As with the vast majority of statistical procedures, a series of assumptions underpins OLS regression. If these are not met we have little reason to put our faith in the results we obtain. In this session we go over these assumptions, how they influence the results when they are not met, and what strategies we have to overcome this situation. In the lab we turn to these issues from a practical perspective. We run a test regression in R, assess whether the assumptions are met, and correct for the assumption violations that exist. Through a step-by-step process, you will see how your estimates and model fit changes when engaging in such a process.

Day 5
We recap the most important elements in the regression framework, based on participants’ needs and requests. We also delve into the various ways regression results can be presented, depending on the audience: quantitative scholars, policymakers, or a general readership. I conclude with a presentation on how interactions can expand considerably the range of hypotheses you can test with regression. In the lab I will present code for producing tables and graphs with regression estimates, and will demonstrate how an interaction effect between predictors can be set up, estimated, and evaluated.

Throughout the course, we focus on graphical methods and intuitive quantities when presenting results from linear models. I emphasise graphs over tables, predicted values and uncertainty around them, rather than coefficients and standard errors. While tables of coefficients are still the dominant way of presenting results in academic journals, graphs and predicted values tend to be preferred in reports and analyses for larger, non-technical audiences. I believe strongly that you should be familiar with both types, and should tailor the delivery of your results to the audience.

Day Topic Details
Day 1 From correlation to regression: revisiting the basics

We cover a few foundational concepts in statistics: correlation, standard error, t test, t and z distributions. We also make our first forays into the regression setup.

In the lab part, we get familiar with R or Stata, and try a few basic data manipulation and transformation tasks. All of these tasks habitually have to be performed before running a regression.

Day 2 OLS fundamentals, coefficients, and graphical displays: coefficients and model fit.

We go through the estimation of OLS models and the interpretation of coefficients for simple and multiple regression.

In the lab session, we run a few regressions in R or Stata, based on the code supplied by the instructor, and go through interpreting coefficients and measures of model fit once more. We also introduce a way of presenting effect sizes based on predictions, using the model estimates.

Day 3 Dummy variables and uncertainty of estimates.

We discuss slightly more complex model specifications which include dummy variables. The bulk of the class, though, is devoted to understanding and interpreting uncertainty in our regression estimates.

In the lab, we go through additional regression models, which involve dummies. Most of our empirical efforts, though, will be allocated to understanding where uncertainty in estimates comes from, how we can minimize it, and how we can responsibly present it to the audience.

Day 4 Regression assumptions: violations and remedies.

This session covers the assumptions underpinning OLS regression, what the implications of assumption violations are, and how to correct for them.

The lab session will offer practical strategies of identifying assumption violations, and overcoming some of them through data transformations. We also see how estimates and model fit statistics change when correcting for some of these violations.

Day 5 Recap, multiplicative interactions in regression, and graphical presentations

In this last session, we review a few of the most important ideas covered in the past four days, based on participants’ requests. I also introduce a way to test more sophisticated hypotheses, about how effects of a predictor vary, through the use of interactions. Finally, I show a few of the ways in which regression results can be presented to the audience, and discuss the strengths and weaknesses of each.

In the lab I show code for interactions, graphical presentations of results, and also allow for a recap of any topics participants feel we should cover again.

Day Readings
Day 1
  • Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the Practice of Statistics. 6th edition. New York: W. H. Freeman and Co. Chapters 5, 6, 7 and 8.
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 2.

(NB: For Moore et al (2009), there is no need to read each of the chapters carefully. Please only focus on the topics that you feel you might need a brush up on. The rest of the topics can be merely skimmed. If the 4 chapters from Moore et al. (2009) seem too intimidating, please at least check the 2 chapters from Field et al. (2012) below. For Fox (2008), focus more on sections 1 and 3 of the chapter – even there, not the sophisticated terms, just the general ideas and logic of the procedure.)

Optional:

  • Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. London: Sage Publications. Chapters 2 and 3.
Day 2
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 5.
  • [if Fox seems too intricate] Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. London: Sage Publications. Chapter 7 (sections 1 and 2, but without 7.2.4; section 7, but without 7.6.3 and 7.6.4).

Optional:

  • Lewis-Beck, Michael S. 1980. Applied Regression – An Introduction. Quantitative Applications in the Social Sciences Series, Vol. 22. London: Sage. Chapter 1 and 3 (only pp. 47-51).
  • Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 1.
Day 3
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapter 6.
  • Hardy, M. A. (1993). Regression with Dummy Variables. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 3.
  • [if Fox seems too intricate] Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. London: Sage Publications. Chapter 7 (section 7.2.4; sections 7.4, 7.5, 7.8, 7.11, 7.12).

Optional: 

  • Lewis-Beck, Michael S. 1980. Applied Regression – An Introduction. Quantitative Applications in the Social Sciences Series, Vol. 22. London: Sage. Chapter 2 and 3 (only pp. 51-52 and 66-71).
  • Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 2.
Day 4
  • Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. 2nd edition. Thousand Oaks, CA: Sage. Chapters 11 and 12.
  • [if Fox seems too intricate] Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using R. London: Sage Publications. Chapter 7 (sections 7.7 and 7.9).

Optional:

  • Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 5. [technical at times, but a thorough treatment]
  • Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models, 5th edition. Boston: McGraw-Hill. Chapter 3.
  • King, G., & Roberts, M. E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis, 23(2), 159–179. [the paper is a bit technical in some of the parts, so maybe skip those more mathematical sections, and try going for sections 1, 6 and 7.]
Day 5
  • Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.
  • Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s Practice What We Preach: Turning Tables into Graphs. The American Statistician, 56(2), 121–130.

Optional:

  • Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression. Quantitative Applications in the Social Sciences Series. London: Sage. Chapter 2.

Two primary textbooks are assigned for this course: John Fox’s book, as a more rigorous yet demanding text, and Andy Field and co-authors’ book, as a backup for the situations when Fox’s discussion seems too advanced. I would recommend that you proceed by using Fox’s book, and then if the discussion in certain sections seems too complex, to use Field et al.’s book for clarifications.

Software Requirements

R 3.5.2 or any newer version.

Stata 13.1 or any newer version.

RStudio 1.2.747 or any newer version

Hardware Requirements

Any computer or laptop bought within the last 4-5 years should be sufficient. 2 GB of RAM and 200-300 MB of free space on the hard drive are enough for running the tasks we will attempt.

Literature

I have tried, as much as possible, to assign chapters from the same textbook, so as to minimize disruptions in logic and in the way the topics are approached. However, if you encounter difficulties in tracking down the literature above, please try the sources below as well. Some are more advanced, though, and choose to present the topic in a more mathematical way.

  1. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression / Correlation Analysis for the Behavioral Sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates. Chapters 2, 3, 4, 5, 6, 7, 8, 9, and 10. [a more advanced treatment of regression analysis]
  2. Fox, J. (1991). Regression Diagnostics. Quantitative Applications in the Social Sciences Series. London: Sage.
  3. Kam, C. D., & Franzese, R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press. Chapters 3 and 4.
  4. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. 5th edition. New York: McGraw-Hill.
  5. Lewis-Beck, M. S. (1980). Applied Regression: An Introduction. Quantitative Applications in the Social Sciences Series. London: Sage. [another classic]
  6. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis. 5th edition. New York: Wiley. Chapters 2, 3, 4, 6, 9, 12, and 13 [the book is targeted at a more advanced audience, and is fairly formula-heavy]

For assistance with running regressions in R/Sata, please try the following books:

  1. Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression. 2nd edition. London: Sage.
  2. Hamilton, L. C. (2013). Statistics with Stata: Version 12. 8th edition. Boston, MA: Cengage.
  3. Rabe-Hesketh, S., & Everitt, B. (2004). A Handbook of Statistical Analyses Using Stata. 3rd edition. Boca Raton, FL: Chapman & Hall.

Recommended Courses to Cover Before this One

<p><strong>Summer School</strong></p> <p>Introduction to R<br /> Introduction to Stata<br /> <span style="color:#00000a">Introduction to Inferential Statistics: What you need to know before you take regression</span></p> <p><strong>Winter School</strong></p> <p>Introduction to R<br /> Introduction to Stata<br /> Introduction to Statistics for Political and Social Scientists</p>

Recommended Courses to Cover After this One

<p><strong>Summer School</strong></p> <p>Intro to GLM: Binary, Ordered and Multinomial Logistic, and Count Regression</p> <p><strong>Winter School</strong></p> <p>Interpreting Binary Logistic Regression Models</p>


Additional Information

Disclaimer

This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Conveners

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.