Deduplication of the media-based event databases
Deepti Joshi (),
Regina Werum,
Dalton Hazelwood,
Shawn Ratcliff,
Ashok Samal and
Leen-Kiat Soh
Additional contact information
Deepti Joshi: The Citadel
Regina Werum: University of Nebraska-Lincoln
Dalton Hazelwood: The Citadel
Shawn Ratcliff: University of Nebraska-Lincoln
Ashok Samal: University of Nebraska-Lincoln
Leen-Kiat Soh: University of Nebraska-Lincoln
Journal of Computational Social Science, 2025, vol. 8, issue 3, No 22, 33 pages
Abstract:
Abstract Event databases play a crucial role in documenting and examining spatiotemporally distributed events ranging from protests over disasters, to public health emergencies. The Global Database of Events, Language, and Tone (GDELT) is notable for its comprehensive automated cataloging of events from many news sources. By extension, its extensive scope augments the risk of counting unique events repeatedly because they are reported by multiple sources. This leads to an inaccurate assessment of event level dynamics. This article presents a new, automated deduplication technique specifically designed to improve the event identification and counting accuracy of GDELT data. By consolidating redundant event entries into a unified, comprehensive record, our approach effectively mitigates the issue of overcounting, while enhancing the integrity and usefulness of the data. We assess the effectiveness of our method by conducting thorough algorithmic testing, and by comparing it to other established datasets such as Armed Conflict Location and Event Dataset (ACLED) and Integrated Crisis Early Warning System (ICEWS). The comparative analysis employs the deduplicated GDELT data to predict local protest levels, demonstrating that our deduplication procedure not only decreases the number of overcounted events but also better aligns GDELT with other event databases, thereby validating the effectiveness of our methodology. Our findings inform empirical research dependent on media-reported event databases like GDELT, with broader implications for other fields reliant on data collection and sources affected by directional measurement error.
Keywords: Deduplication; Clustering event records; Directional measurement error; Validity; Reliability; GDELT (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s42001-025-00409-4 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jcsosc:v:8:y:2025:i:3:d:10.1007_s42001-025-00409-4
Ordering information: This journal article can be ordered from
http://www.springer. ... iences/journal/42001
DOI: 10.1007/s42001-025-00409-4
Access Statistics for this article
Journal of Computational Social Science is currently edited by Takashi Kamihigashi
More articles in Journal of Computational Social Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().