Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Helmstetter, Stefan; Paulheim, Heiko

Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Stefan Helmstetter and Heiko Paulheim
Additional contact information
Stefan Helmstetter: Data and Web Science Group, School of Business Informatics and Mathematics, University of Mannheim, B6 26, 68159 Mannheim, Germany
Heiko Paulheim: Data and Web Science Group, School of Business Informatics and Mathematics, University of Mannheim, B6 26, 68159 Mannheim, Germany

Future Internet, 2021, vol. 13, issue 5, 1-25

Abstract: The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source , and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets . Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

Keywords: fake news; Twitter; weak supervision; source trustworthiness; social media (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/13/5/114/pdf (application/pdf)
https://www.mdpi.com/1999-5903/13/5/114/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:13:y:2021:i:5:p:114-:d:546014

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().