Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”


Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Back to Panel Details
Back to Panel Details

Introduction to Quantitative Text Analysis

Kostas Gemenis

Max Planck Institute for the Study of Societies – MPIfG

Kostas Gemenis is Senior Researcher in Quantitative Methods at the Max Planck Institute for the Study of Societies.

His research interests include measurement in the social sciences, and content analysis with applications to estimating the policy positions of political actors.

He is currently involved in Preference Matcher, a consortium of researchers who collaborate in developing e-literacy tools designed to enhance voter education.


Course Dates and Times

Monday 25 February – Friday 1 March, 09:00–12:30
15 hours over five days

Prerequisite Knowledge

You should be familiar with basic statistical concepts such as measures of central tendency (mean, median), dispersion (standard deviation), tests of association (Pearson’s r) and inference (χ2, t-test).

These materials are covered in the first few chapters of introductory statistics or data analysis textbooks. A useful example is Pollock P.H. III, The Essentials of Political Analysis, fourth edition (Washington, DC: CQ Press, 2012), Chapters 2, 3, 6, and 7.

Some familiarity with R statistical software is also desirable but not necessary. In most of the classes we will use R Studio.

Short Outline

This course introduces you to the family of quantitative text analysis methods in the ‘content analysis’ tradition, using a variety of examples from political science and related disciplines.

We will cover basic aspects of content analysis, starting with manual content analysis and continuing with an introduction to some of the most popular approaches to computer-assisted text analysis.

You will learn practical aspects of text analysis, such as creating coding schemes, selecting documents, assessing inter-coder reliability, scaling, and validating the text analysis output.

The course comprises a mix of lectures and seminars, and will involve hands-on exercises, the majority of which will be completed by following, step-by-step, code provided in the R statistical software, so prior knowledge of R is not necessary.

You will get the opportunity to present your own project in class and receive constructive feedback.

Tasks for ECTS Credits

2 credits (pass/fail grade). Attend 90% of course hours and participate fully in in-class activities. Carry out the necessary reading and/or other work prior to, and after, class.

3 credits (to be graded) As above, plus complete assignments on Tuesday, Wednesday, and Thursday evenings, based on the methods illustrated during the seminars of the same days.

4 credits (to be graded) As above, plus complete a seated multiple-choice exam.

Long Course Outline

Most social science concepts are not directly observable, so text analysis can provide a useful method for measuring quantities of interest that would otherwise be difficult to estimate. By analysing the speeches of legislators, for example, we can classify them as charismatic, populist, authoritarian, liberal, and so on. Similarly, by analysing the content of newspaper editorials, we can infer whether the media in question were biased in favour of a particular candidate during an election campaign.

Text analysis is a specific type of content analysis, typically defined as a method whose goal is to summarise a body of information in the form of text, in order to make inferences about the actor behind this body of information. This implies that text analysis can be seen as a data reduction method since its goal is to reduce the text material to more manageable bits of information.

Text analysis can be also seen as a method for descriptive inference. Weber (1990, p. 9) for instance, defines content analysis as ‘a method that uses a set of procedures to make valid inferences from text’. The idea is that, by analysing the textual output of an actor, we can infer something about this actor. This conceptualisation of content and text analysis implies that we can use it as a tool for measurement in the social sciences. In this view of content analysis we are concerned with replicability and objectivity, (Neuendorf 2002, pp. 10-15), and therefore we should distinguish text analysis from other approaches/methods such as discourse analysis, rhetorical analysis, constructivism, ethnography and so on.

This course will familiarise you with manual and computer-assisted text analysis. Following Krippenforff (2004) and Neuendorf (2002), it will introduce you to the basic concepts and building blocks in content analysis designs. We will address and discuss the following questions:

  • Coding scheme What are the theoretical underpinnings of the coding scheme? How are the categories selected and operationalised? What are the coding units? How is coding performed? Is our coding scheme valid?
  • Selection of documents What guides the selection of texts? Are texts sufficiently comparable? Are our documents valid and reliable indicators of the quantities of interest? How can we acquire and process text for computer-assisted text analysis?
  • Aggregation Are texts coded by different coders? If so, how are their results aggregated? If not, how can we ensure inter-coder reliability? What statistical measures can be used to estimate inter-coder reliability?
  • Scaling Are we estimating the quantities of interest directly? If not, how do we scale data in order to estimate the quantities of interest? Is our scaling valid and reliable?

For manual text analysis, the course will also look at the often-overlooked distinction between the analysis of manifest content and judgemental coding. For computer-assisted text analysis, it will introduce you to a variety of popular methods, such as the use of content analysis dictionaries (including sentiment analysis), scaling methods (wordscores, wordfish), and supervised and unsupervised learning approaches (including topic models). We will discuss the relationship between reliability and validity, illustrate methods for estimating inter-coder reliability, and explore the links between manual and computer-assisted text analysis in terms of validation and training of supervised classification methods.

The course will be taught via a mix of lectures, seminars and hands-on exercises. The examples used to illustrate the promises as well as the pitfalls of content analysis will be concerned with various applications across the social sciences (e.g. sentiment analysis of the press, frames analysis of social movements, estimating the positions of political actors, agenda-setting in the EU), while the majority of the exercises will involve following, step-by-step, code provided in the R statistical software, so previous knowledge of R is not necessary. In most of the seminars we will use R Studio.

Download R Studio

You will also have the opportunity to present your own project in class and receive constructive feedback.

Day Topic Details
1 Introduction and manual coding of text Inter-coder reliability

Lecture (90 mins)

  • Brief presentation of participants and their research projects
  • Key concepts in content analysis
  • Best practices for defining a coding scheme, selecting the appropriate documents, coding the documents; scaling the coded data
  • Designing a manual content analysis project
  • Latent coding and crowdsourcing

Seminar (90 mins)

  • Reliability and validity and their relationship to measurement error
  • Estimating inter-coder reliability using Krippendorff's alpha
2 Document pre-processing and dictionary methods Sentiment analysis in R

Lecture (90 mins)

  • The promises of computer-assisted content analysis (and four rules for good practice)
  • Selecting, cleaning and formatting documents
  • Computer-assisted content analysis and dictionary construction

Seminar (90 mins)

  • Illustration of the dictionary use in sentiment analysis
  • Validating sentiment analysis
  • Effective data visualisation and inference in content analysis
3 Scaling methods in text analysis Wordscores and Wordfish in R

Lecture (90 mins)

  • Scaling models in text analysis and their assumptions
  • Supervised method: Wordscores
  • Unsupervised method: Wordfish

Seminar (90 mins)

  • Illustration of Wordscores and Wordfish
  • Validating scaling methods
4 Supervised classification methods Μachine/statistical learning in R

Lecture (90 mins)

  • Classification models in text analysis
  • Supervised methods
  • Evaluation metrics

Seminar (90 mins)

  • Illustration of supervised machine/statistical learning methods
5 Unsupervised classification methods Topic models in R

Lecture (90 mins)

  • Topic models in text analysis
  • Comparisons and trade-offs in content analysis

Seminar (90 mins)

  • Illustration of LDA
  • Validating topic models
Day Readings

Hayes and Krippendorff (2007), Krippendorff (2004), Neuendorf (2002), optional: Benoit et al. (2015), Gemenis (2015)


Grimmer and Stewart (2013), Laver and Garry (2000), Young and Soroka (2012)


Grimmer and Stewart (2013), Laver et al. (2003), Slapin and Proksch (2008), Bruinsma and Gemenis (2017)


Grimmer and Stewart (2013)


Grimmer and Stewart (2013), Hopkins and King (2010), Van der Zwaan et al. (2016)

Software Requirements

R and R Studio

Yoshikoder, Lexicoder, and Jfreq free software downloads

Hardware Requirements



Benoit, Kenneth, Drew Conway, Benjamin E. Lauderdale, Michael Laver, and Slava Mikhaylov (2015)
Crowd-sourced text analysis: reproducible and agile production of political data

American Political Science Review 110: 278–295

Bruinsma, Bastiaan and Kostas Gemenis (2017) Validating Wordscores 

Gemenis, Kostas (2015)
An iterative expert survey approach for estimating parties’ policy positions
Quality & Quantity, 49: 2291–2306

Grimmer, Justin, and Brandon M. Stewart (2013)
Text as data: The promise and pitfalls of automatic content analysis methods for political texts
Political Analysis 21: 267–297

Hayes, Andrew F., and Klaus Krippendorff (2007)
Answering the call for a standard reliability measure for coding data Communication
Methods and Measures 1: 77–89

Hopkins, Daniel J., and Gary King (2010)
A method of automated nonparametric content analysis for social science
American Journal of Political Science 54: 229–247

Krippendorff, Klaus (2004)
Content analysis: An introduction to its methodology, second edition
Thousand Oaks, CA: Sage, Chapters 5 (unitizing) and 7 (coding)

Laver, Michael, Kenneth Benoit, and John Garry (2003)
Extracting policy positions from political texts using words as data
American Political Science Review 97: 311–331

Laver, Michael, and John Garry (2000)
Estimating policy positions from political texts
American Journal of Political Science 44: 619–634

Neuendorf, Kimberly A. (2002)
The Content Analysis Guidebook
Thousand Oaks, CA: Sage, Chapter 1 (defining content analysis)

Slapin, Jonathan B., and SvenOliver Proksch (2008)
A scaling model for estimating time-series party positions from texts
American Journal of Political Science 52: 705–722

van der Zwaan, J. M., Marx, M., & Kamps, J. (2016)
Validating Cross-Perspective Topic Modeling for Extracting Political Parties' Positions from Parliamentary Proceedings
ECAI (pp. 28–36)

Young, Lori, and Stuart Soroka (2012)
Affective news: The automated coding of sentiment in political texts
Political Communication 29: 205–231

Recommended Courses to Cover Before this One

<p><strong>Summer School </strong></p> <p>R Basics</p> <p>Introduction to Inferential Statistics: What you need to know before you take regression</p> <p><strong>Winter School</strong></p> <p>Automated Web Data Collection with R</p> <p>Introduction to R (entry level)</p> <p>&nbsp;</p>

Recommended Courses to Cover After this One

<p><strong>Summer School </strong></p> <p>Python Programming for Social Scientists: Big Data, Web Scraping and Other Useful Programming Tricks</p> <p>Automated Collection of Web and Social Data</p> <p>Big Data Analysis in the Social Sciences</p> <p><strong>Winter School</strong></p> <p>Python Programming for Social Sciences: Collecting, Analysing and Presenting Social Media Data</p> <p>&nbsp;</p>

Additional Information


This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc). Registered participants will be informed in due time.

Note from the Academic Conveners

By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, contact the instructor before registering.