Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media

Workshop: Novel Evaluation Approaches for Text Classification Systems (NEATCLasS)

DOI: 10.36190/2023.52

Published: 2023-06-01
Benchmark Evaluation for Tasks with Highly Subjective Crowdsourced Annotations: Case study in Argument Mining of Political Debates
Rafael Mestre, Matt Ryan, Stuart E. Middleton, Richard Gomer, Masood Gheasi, Jiatong Zhu, Timothy J. Norman

This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining.