ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

A validation framework for multilingual computational text analysis in the social sciences

Political Methodology
Methods
Comparative Perspective
Big Data
Fabienne Lind
University of Vienna
Fabienne Lind
University of Vienna

Abstract

Fueled by the pioneering work of Lucas et al. (2015), Computational Text Analysis (CTA) is increasingly used to compare text across different languages. Research teams have used multilingual text analysis to study, for example, the tone of presidential candidates (Maurer & Diehl, 2020) or civility on social media (Theocharis et al., 2016). New methodological approaches are being proposed as well, including multilingual dictionaries (Lind et al., 2019; Proksch et al., 2019), multilingual supervised machine learning (Courtney et al., 2020; Lind et al. 2021), multilingual topic modeling (Chan et al., 2020; Lind et al., 2022; Maier et al., 2022), and multilingual word embeddings (De Vries, 2021; Licht, 2023). Yet despite these exciting developments, there is still little agreement on the appropriate strategies and criteria for meaningfully validating multilingual text analysis tasks and methods for social science research purposes. In the methodological literature and in reviews on the state of computational text analysis (e.g., van Atteveldt & Peng, 2018; Barberá et al., 2021; Gentzkow et al., 2019; Grimmer et al., 2022; Song et al., 2020), discussions of validation strategies are generally limited to single-language applications, with limited reference to multilingualism. Baden et al. (2022) show that validation concerns are higher among those researchers working with multiple languages but that their concern is not reflected in higher focus on validation in published work. We argue that to stimulate the increasing but still reluctant use of multilingual automated text analysis methods in the social sciences (Baden et al., 2022), having a framework for validating multilingual uses of computational text analysis methods is instrumental. Whereas a researcher working in one language ‘only’ needs to ascertain that the textual data at hand meaningfully captures their concept of interest, a researcher working in more than one language needs to provide evidence that this link between the conceptual and empirical realm is comparable across cases. Validation in multilingual settings might even be more urgent if we consider the increasing popularity of analytical approaches that rely on third-party pretrained materials such as large language models. Research findings that have not been validated properly may have detrimental unintended consequences, ranging from erroneous and artifactual findings, to stereotypical information baked into available language models, to disastrous real-world implications such as algorithmic bias in hiring decisions (see e.g., Zhao et al. 2019; Bender et al. 2021). This contribution proposes a practical framework for the validation of multilingual computational text analysis. In what follows, we first define key concepts and discuss core challenges of validation in a multilingual setting. We then systematize validation strategies by differentiating between data, input, throughput, and output validation and end with a discussion of capacities and limitations of existing validation techniques and outline paths to further advance validation practices. While the specific challenges and needs for validation remain of course sensitive to the specific setting and purpose of each research application, the framework is designed to be applicable to various text types and forms of computational textual analysis relevant to social scientists, enabling its wide adoption and use.