Can we Automatedly Measure the Quality of Online Political Discussion? How to (Not) Measure Interactivity, Diversity, Rationality, and Incivility in Online Comments to the News
Democracy
Political Methodology
Methods
Quantitative
Social Media
Abstract
Deliberative perspectives on democracy require citizens to cooperate to solve common issues by sharing insights and being open to learning from others. With the advent of the digital age, much of that communication occurs online and in vast quantities. The many-to-many nature of online social media discussion makes such exchanges spread out over many hashtags and sub-debates. Surprisingly, measuring deliberation in online social media comments is still mostly done manually (Goddard & Gillespie, 2023). We present an inventory of automatic measures for annotating the deliberative quality of online user comments along the standards set out by Habermas: interactivity, diversity, rationality, and (in)civility, based on four model groups ranging from dictionaries to modern generative AI models. Altogether, we present results for over 50 metrics. The performance of all these methods/models is evaluated against a novel hand-coded dataset using an extensive codebook of 14 different individual items tapping into these concepts. The dataset contains 3862 carefully selected manually coded comments responding to news videos on YouTube and Twitter.
The results revealed that the choice of method has a very strong effect and different methods lead to vastly different results. Overall, expectations that more modern methods (transformers and generative AI) outperform the older, simpler ones are verified. However, the absolute differences between these model groups depended a lot on the concept measured, ranging from only 0.08 between rule-based and Llama3.1 for incivility, to 0.28 for conservatism. Also, the differences within model groups were considerable. For example, the best rule-based metric for incivility, Ksiazek’s hostility dictionary, attained an F1 of 0.67, while the worst one, Hatebase.org wordlist, stalled at an F1 of 0.45.
A bit counterintuitively it turned out that overall simple Llama3.1 prompts (rather than sophisticated prompts or trained transformers tailored to the specifics of the dataset) performed best (F1 ranging from 0.64 to 0.81) and came close to or surpassed the performance of the other techniques. Still, the precision scores identifying the presence of a concept in a comment remain a challenge across models for interactivity (best score: 0.62), rationality (0.56), conservatism (0.60), liberalism (0.54), which illustrates the remaining difficulty of classifying these concepts automatically.
We make recommendations for future research balancing ease of use and specifics of the use case to the performance of the metrics. Specifically, we propose that a manually coded dataset for validation remains important, even for rule-based or generative AI models that don’t need training data. Given the performance differences between dictionaries and prompt versions, the selection of the best set-up should be substantiated by performance against manually coded data. Especially for generative AI, changing a prompt can be simple, but fitting the results against a single validation set can lead to overfitting. Therefore, we propose following the standard practice in machine learning to select the best model (configuration) on a training set and validate its performance on a test set. Altogether, we provide clear guidance for how to measure deliberation and what traps to avoid.