Imputation of User Generated Data

Political Methodology
Public Opinion
Garret Binding
University of Zurich
Garret Binding
University of Zurich
Thomas Willi
University of Zurich

The digitalisation of society has created new avenues for research in political sciene. One of these avenues is the availability of ever larger datasets containing political data on individuals. Voting Advice Applications are one example where user generated datasets are created. Their sample sizes are much larger than those of conventional and comparable representative surveys and may cover a broader spectrum of the population. These attributes are very attractive to social scientists. However, questions of representativeness loom. These datasets are user generated and access is limited to those with adequate technical means. An assessment of non-representativeness is often hindered by the non-availability of data within the large dataests: users choose to not answer many questions, leading to a large share of missing data. We address this problem within the missing data framework proposed by Little and Rubin (2002). We proceed by comparing the results from established imputation techniques with those from machine learning techniques in a simulation study. We believe that establishing the validity of machine learning techniques is important for future research as the number of large datasets available to social scientists increases. Finally, we apply both approaches of imputation to VAA data collected prior to the 2013 German election. Our contribution is twofold. First, we evaluate whether machine learning techniques can be used for missing data imputation in large user generated datasets. Second, by imputing missing data we can reassess the problem of non-representativeness in VAA data.
Share this page

"Nothing in politics is ever as good or bad as it first appears" - Edward Boyle

Back to top