ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Resource-efficient Sequential Coding of the Most Important Problem with a Large Language Model

Methods
Quantitative
Public Opinion
Survey Research
Big Data
Jan Marquardt
GESIS Leibniz-Institute for the Social Sciences
Jan Marquardt
GESIS Leibniz-Institute for the Social Sciences
Julia Weiß
GESIS Leibniz-Institute for the Social Sciences

Abstract

Large survey programs repeatedly ask the same open-ended questions over time, for example, about the most important political problems or issues. On the one hand, this means that information on the correct coding of responses is already available and may facilitate the automatic coding of subsequent responses. On the other hand, in the context of multiple crises (e.g., climate change, the crisis of the international system, the corona pandemic), the salience of specific issues is highly volatile as are responses given to the Most Important Problem (MIP) question. With the rapid development of Large Language Models such as BERT and GPT, research on the applicability of previously collected training data for the automatic coding of later survey rounds is desirable. This paper addresses this and investigates whether previously coded responses can be used for resource-efficient coding of newly acquired and yet uncoded responses when using BERT for classification. Three strategies for different amounts of additional manually coded responses from the target domain are compared using responses to the MIP question in the German Longitudinal Election Study (GLES). Results show that it is indeed a viable strategy to combine previously acquired training data with relatively new training cases, which significantly reduces the need for manual coding. In particular, a strategy that subsequently replaces previous training cases with new training cases proves to be very efficient. Furthermore, a class-specific analysis reveals relevant variations. As expected, rare topics benefit from previous training data. On the other hand, there are topics for which older training data can be a distraction because they are quite dissimilar between source and target domain, e.g. because they relate to different events or people. Overall, the paper demonstrates the potential for efficient longitudinal analysis of textual data with political content when building on innovations from the fields of machine learning and AI.