To access full paper downloads, participants are encouraged to install the official Event App, available on the App Store.
Abstract
Governments around the world are increasingly deploying generative artificial intelligence (AI), particularly large language model (LLM)–based chatbots, to provide administrative information and frontline public services. While these systems promise efficiency gains and improved accessibility, they also raise fundamental concerns about factual accuracy, unequal treatment, and the reproduction of social biases embedded in training data. Despite growing scholarly and public attention to these risks, systematic empirical evidence on how government chatbots perform across different citizen groups remains limited. This study addresses this gap by conducting the first large-scale experimental audit of local government chatbots in Germany, examining both discriminatory bias and a less-studied but normatively important trade-off between fairness and adaptive responsiveness.
The study advances three core arguments. First, drawing on competing perspectives in the literature on algorithmic bias, we argue that government chatbots may either reproduce demographic disparities in service quality or, alternatively, exhibit near-parity as a result of heightened technical awareness and institutional pressures—including legal, reputational, and political constraints—to avoid discrimination. Second, we theorize a bias–helpfulness trade-off: design choices intended to suppress bias may reduce legitimate personalization and thereby undermine usability, leading to standardized responses even when users differ in language proficiency. Third, we contend that chatbot configurations are shaped by divergent institutional logics. Public officials, operating under mandates of equal treatment and accountability, are likely to exhibit lower tolerance for factual errors and biased outputs, whereas AI company representatives emphasize innovation, adaptivity, and iterative improvement. In this sense, equal treatment reflects a de-biasing logic that may outweigh capacity- or helpfulness-oriented considerations.
To test these claims, we employ a mixed-methods research design. The quantitative component consists of a large-scale audit experiment of twelve LLM-based chatbots deployed by German municipalities and job centers. Using a factorial design, we systematically vary user personas along gender, migration background, and language proficiency while holding substantive requests constant across 8,640 chatbot interactions. Responses are evaluated along multiple informational and relational dimensions, including informational volume, relevance, emotional tone, linguistic complexity, and consistency. This design enables the detection of subtle forms of unequal treatment that extend beyond overt factual errors, while equivalence testing allows for a rigorous assessment of near-parity claims. The qualitative component comprises semi-structured interviews with public officials responsible for chatbot deployment and AI company representatives involved in system design and maintenance. These interviews probe respondents’ tolerance for errors and bias, their views on acceptable trade-offs between correctness and personalization, and their strategies for managing algorithmic risk.
Taken together, this study aims to make three contributions. Empirically, it provides the most comprehensive audit to date of government chatbot performance and bias in a real-world administrative setting. Theoretically, it reframes debates on algorithmic fairness by highlighting the risk of “algorithmic rigidity,” whereby equality is achieved through uniform but insufficiently adaptive service provision. Substantively, it offers policy-relevant insights into how governments can balance fairness, usability, and innovation when integrating generative AI into public administration.