ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Distributed Asymmetric Allocation: A Topic Model for Large Imbalanced Corpora

Political Methodology
UN
Quantitative
Big Data
Kohei Watanabe
Waseda University
Kohei Watanabe
Waseda University

Abstract

Identifying topics has been one of the most common types of content analysis in political science because topics in legislative speeches, policy documents and news articles reflect important issues and agendas. Researchers can use topic models that automatically identify clusters of words as topics based on their co-occurrences in documents, but they often find the topics only loosely match the theoretical concepts. Such disagreements between algorithmic topics and theoretical topics have been a long-standing methodological issue. One solution to the problem is manually matching algorithmic topics to theoretical topics in ex-post mapping, where many algorithmic topics are manually classified into fewer theoretical topics. This approach has been very common but criticized for the risk of constructing arbitrary topics. An alternative approach is ex-ante mapping, where theoretical topics are defined using seed words before applying topic models. Several algorithms, such as keyword-assisted topic models and seeded latent Dirichlet allocation (LDA), have been developed for this theory-driven approach. Nevertheless, the theory-driven approach has remained to be less common presumably because of the challenges posed by the imbalance in the frequencies of topics. If LDA is applied in a conventional manner, its algorithm often estimates the frequencies of generic topics too low and specific topics too high, leading to inaccurate classification of theoretically important topics. This problem is even more serious in a corpus of short documents, where the topic models cannot easily infer overall frequencies of topics from individual documents. Aiming to make topic modeling more theoretically grounded and content analysis more granular, I have developed and implemented a new topic model called distributed asymmetric allocation (DAA) as an open-source software package by extending LDA. DAA can identify topics in a very large corpus of short documents, enabling researchers to correlate topics with other properties in each sentence. If seed words are provided, it can discover infrequent topics in an imbalanced corpus, allowing them to focus on theoretically important concepts in their analysis. I evaluate the ability of DAA to identify theoretically important topics by fitting it to the transcripts of speeches at the United Nations General Assembly. The results show that DAA can fit multiple times faster than LDA thanks to the distributed computing and convergence detection algorithms. It can also classify sentences significantly more accurately than LDA owing to the sequential sampling and the Dirichlet prior optimization algorithms. If the results of the analysis are compared, the frequencies of topics varied more widely and changed more strongly in DAA than in LDA, corresponding to the occurrences of key political events between 1991 and 2017. The results suggest that it is important for the users of LDA-based topic models to optimize the sizes of Dirichlet priors as well as the number of topics. If the corpus is large and imbalanced, enhanced algorithms such as DAA are required for accurate content analysis. More generally, the successful development of DAA demonstrates that it is still possible to improve the classic topic model for greater efficiency and transparency.