Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Monday 5 – Friday 9 August
09:00–10:30 / 11:00–12:30
The topics of this advanced course on quantitative text analysis will range from machine learning algorithms, such as Random Forest, over topic models, seeded topic models, LSS, newsmap, to word embeddings.
You will also learn how to automatically derive extra information from syntactic structures in the texts.
The course will end with an interactive discussion on participants’ research projects, and your own text analysis tools developed in R.
2 credits (pass/fail grade). Attend 90% of course hours and participate fully in in-class activities. Carry out the necessary reading and/or other work prior to, and after, class.
3 credits (to be graded) As above, plus complete daily assignments based on the methods illustrated during the seminars.
4 credits (to be graded) As above, plus write a text analysis function for R.
Lisa Lechner is Assistant professor for methods and methodology in political science at the University of Innsbruck.
In her research, Lisa studies international treaties such as trade agreements, bilateral tax treaties, and environmental agreements, as well as national and international jurisdictions by dint of inferential network- and quantitative text-analysis.
Kohei Watanabe is an assistant professor at the Department of Political Science / Center for Digital Science at the University of Innsbruck.
He holds an MA from CEU, and studied for his PhD at the London School of Economics and Political Science.
Kohei develops quanteda, the R package for quantitative text analysis to research international and political communication.
This course will revisit topics in Introduction to Quantitative Text Analysis but goes much deeper into theoretical and technological foundations of quantitative text analysis to be able to develop complex analytic pipelines in research projects.
Day 1
The lecture will advance your knowledge of supervised and unsupervised methods, focusing on their strengths and weaknesses. We will cover Wordscores and naïve Bayes classifiers, Random Forest, latent Dirichlet allocation (LDA), and Structural Topic Model (STM). Wordscores and naïve Bayes classifiers are simple supervised algorithms for document scaling and for document classification, respectively. Random Forest can be used for both purposes, but it has a more sophisticated algorithm. LDA and STM are unsupervised algorithms for topic classification, but the latter can take into account document-level variables. We learn how to apply these models in the seminar.
Day 2
We will explain a seeded LDA model as well as LSS and Newsmap that offers compromise between strengths and weaknesses of supervised and unsupervised methods. These models rely on exemplary words (seed words) as supervision to perform document scaling or classification tasks. Semi-supervised models can be used for similar purposes as both supervised and unsupervised models, but training semi-supervised models demands special attention to the semantics of seed words. We will learn how to use these models in the seminar.
Day 3
We discuss the word-embeddings technique that helps us accurately estimate semantic proximity of words in a large corpus. Although there are few applications of this technique in political science research, recently developed models Word2vec and GloVe attracted the attention of many quantitative text analysts to the technique. We explore its potential in the seminar.
Day 4
In the lecture, you will learn how to derive extra information from syntactic structures in texts and how to use that information to perform fine-tuned analysis. In the seminar, we apply syntactic parser, which recognises part-of-speech and dependencies of words, to improve text pre-processing, and geographical parsing (geoparser.io), which is a combination of a syntactic parser and a geographic database, to identify places mentioned in texts.
Day 5
You should come to the lecture with concrete research ideas involving quantitative text analysis. Some of you will be asked to present your ideas to initiate a class-wide discussion on how to choose analytic methods in actual research projects. In the seminar, you will learn how to develop your own text analysis tools by combining NLP functions in R.
You should have experience in quantitative text analysis in R – textual data management and preprocessing.
Basic knowledge of programming (object types, control flow, loop etc.) is desirable.
Each course includes pre-course assignments, including readings and pre-recorded videos, as well as daily live lectures totalling at least two hours. The instructor will conduct live Q&A sessions and offer designated office hours for one-to-one consultations.
Please check your course format before registering.
Live classes will be held daily for two hours on a video meeting platform, allowing you to interact with both the instructor and other participants in real-time. To avoid online fatigue, the course employs a pedagogy that includes small-group work, short and focused tasks, as well as troubleshooting exercises that utilise a variety of online applications to facilitate collaboration and engagement with the course content.
In-person courses will consist of daily three-hour classroom sessions, featuring a range of interactive in-class activities including short lectures, peer feedback, group exercises, and presentations.
This course description may be subject to subsequent adaptations (e.g. taking into account new developments in the field, participant demands, group size, etc.). Registered participants will be informed at the time of change.
By registering for this course, you confirm that you possess the knowledge required to follow it. The instructor will not teach these prerequisite items. If in doubt, please contact us before registering.
Day | Topic | Details |
---|---|---|
1 | Supervised and unsupervised models |
Lecture Lab |
2 | Semi-supervised models |
Lecture Lab |
3 | Word embeddings |
Lecture Lab |
4 | Syntactic parsing |
Lecture Lab |
5 | Research strategies and programming |
Lecture Lab |
Day | Readings |
---|---|
1 |
Benoit, Laver, and Mikhaylov (2009) Chang et al. (2009) |
2 |
Lu et al. (2011) Watanabe (2017) Watanabe (2018) |
3 |
Turney and Pantel (2010) Spirling and Rodriguez (2019) |
4 |
Atteveldt et al. (2017) Buscaldi (2011) |
R (3.4 or later) and RStudio
Please bring your own laptop that meets the minimum system requirements for the quanteda package.
Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162.
Hastie, T. J., Tibshirani, Robert J, & Friedman, Jerome H. (2013). The elements of statistical learning: data mining, inference, and prediction. New York, NY: Springer.
Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., & Van Atteveldt, W. (2013). RTextTools: A Supervised Learning Package for Text Classification. The R Journal, 5, 6–12.
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Pearson Prentice Hall.
Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D. (2015). Computer-Assisted Text Analysis for Comparative Politics. Political Analysis, 23(2), 254–277.
Manning, C. D., & Schütze, H. (2001). Foundations of statistical natural language processing. Cambridge (Mass.): MIT press.
Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
<p style="text-align:left">Introduction to Quantitative Text Analysis</p>