Curating Cyberbullying Datasets: A Human-AI Collaborative Approach.

Christopher E. Gomez; Marcelo O. Sztainberg; Rachel E. Trana

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: EJ1420453

Record Type: Journal

Publication Date: 2022

Pages: 12

Abstractor: As Provided

ISBN: N/A

ISSN: ISSN-2523-3653

EISSN: EISSN-2523-3661

Available Date: N/A

Curating Cyberbullying Datasets: A Human-AI Collaborative Approach

Christopher E. Gomez; Marcelo O. Sztainberg; Rachel E. Trana

International Journal of Bullying Prevention, v4 n1 p35-46 2022

Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers' majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.

Descriptors: Video Technology, Computer Software, Computer Mediated Communication, Bullying, Artificial Intelligence, Computational Linguistics, Identification, Accuracy, Reliability, Bias, Cultural Differences, Language Variation, Algorithms, Comparative Analysis, Classification, Language Usage

Springer. Available from: Springer Nature. One New York Plaza, Suite 4600, New York, NY 10004. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-460-1700; e-mail: customerservice@springernature.com; Web site: https://link-springer-com.bibliotheek.ehb.be/

Publication Type: Journal Articles; Reports - Descriptive

Education Level: N/A

Audience: N/A

Language: English

Sponsor: Department of Education (ED)

Authoring Institution: N/A

Grant or Contract Numbers: P031C160209

Author Affiliations: N/A