The (real) need for a human touch: Testing a human-machine hybrid topic classification workflow on a New York Times corpus

Media

Methods

Quantitative

Communication

Big Data

Presenter(s)

Akos Mate

HUN-REN Centre for Social Sciences

Author(s)

Akos Mate

HUN-REN Centre for Social Sciences

Miklos Sebok

HUN-REN Centre for Social Sciences

Panel Advancing Text-as-Data Approaches

Abstract

The classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of comparative politics (see, for instance, the Comparative Agendas Project – CAP). Although trained human classifiers are still the gold standard for most policy topic labelling projects such as CAP, there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented through the use of supervised machine learning (SML) based classification. In this paper we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front page articles from 1996 to 2006 according to the CAP policy categories. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates, and offering a modest to good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.

Install the app

Install the app

The (real) need for a human touch: Testing a human-machine hybrid topic classification workflow on a New York Times corpus

Abstract