ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Populism monitoring: web (text) data collection, cleaning and analysis.

Media
Populism
Social Media
Big Data
George Makris
Aristotle University of Thessaloniki
George Makris
Aristotle University of Thessaloniki
Ioannis Andreadis
Aristotle University of Thessaloniki

Abstract

Nowadays, the increase of internet use has led to the massive production of internet-based big data. The idea of the vast amount of unexploited information has motivated the academia, companies and institutions to search for the best practices to make the most out of this data. In most cases, data in big data environments (or data lakes) are not necessarily structured, meaning they do not follow a relational structure (i.e. tables with columns and rows). Most of the time they are unstructured or semi-structured. The most common form of unstructured data in the Internet is text. This has led to the development of innovative methods and practices by the scientific community for the effective and comprehensive extraction, transformation, visualization and analysis of text data, a field described as “text analytics”. For the purposes of this paper we aim to study populism on social media (Twitter) and news websites, utilizing various text analytics methods and tools. We have chosen six Greek newspapers and we have collected articles from their websites and tweets posted by their official Twitter accounts in the same period of time. In this manner, we are able to compare the findings of media monitoring when we use two different data sources. At first, we collected tweets and news articles containing the stem words “people” and “popul”. Twittter data were collected through API requests and news websites’ data through web scraping. Then, we transformed the collected data into a tidy format and finally we stored the transformed data into a local database. After that we used text cleaning methods in order to clean the data from much “noise” such as punctuation, stop words etc. and prepare them for analysis. Then we used a series of computer-based approaches for the proper classification of the articles as populist (or not), namely: sentiment analysis, word frequencies and an HITL machine learning algorithm. Our aim is to find the best available method that scholars could use for the automated detection of populism on the Internet, by highlighting the positives and negatives of each one. Also by comparing two different data sources (Twitter and websites) we show the differentiation in populist content someone could extract from those two.