EconPapers    
Economics at your fingertips  
 

Deduplication of the media-based event databases

Deepti Joshi (), Regina Werum, Dalton Hazelwood, Shawn Ratcliff, Ashok Samal and Leen-Kiat Soh
Additional contact information
Deepti Joshi: The Citadel
Regina Werum: University of Nebraska-Lincoln
Dalton Hazelwood: The Citadel
Shawn Ratcliff: University of Nebraska-Lincoln
Ashok Samal: University of Nebraska-Lincoln
Leen-Kiat Soh: University of Nebraska-Lincoln

Journal of Computational Social Science, 2025, vol. 8, issue 3, No 22, 33 pages

Abstract: Abstract Event databases play a crucial role in documenting and examining spatiotemporally distributed events ranging from protests over disasters, to public health emergencies. The Global Database of Events, Language, and Tone (GDELT) is notable for its comprehensive automated cataloging of events from many news sources. By extension, its extensive scope augments the risk of counting unique events repeatedly because they are reported by multiple sources. This leads to an inaccurate assessment of event level dynamics. This article presents a new, automated deduplication technique specifically designed to improve the event identification and counting accuracy of GDELT data. By consolidating redundant event entries into a unified, comprehensive record, our approach effectively mitigates the issue of overcounting, while enhancing the integrity and usefulness of the data. We assess the effectiveness of our method by conducting thorough algorithmic testing, and by comparing it to other established datasets such as Armed Conflict Location and Event Dataset (ACLED) and Integrated Crisis Early Warning System (ICEWS). The comparative analysis employs the deduplicated GDELT data to predict local protest levels, demonstrating that our deduplication procedure not only decreases the number of overcounted events but also better aligns GDELT with other event databases, thereby validating the effectiveness of our methodology. Our findings inform empirical research dependent on media-reported event databases like GDELT, with broader implications for other fields reliant on data collection and sources affected by directional measurement error.

Keywords: Deduplication; Clustering event records; Directional measurement error; Validity; Reliability; GDELT (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s42001-025-00409-4 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:jcsosc:v:8:y:2025:i:3:d:10.1007_s42001-025-00409-4

Ordering information: This journal article can be ordered from
http://www.springer. ... iences/journal/42001

DOI: 10.1007/s42001-025-00409-4

Access Statistics for this article

Journal of Computational Social Science is currently edited by Takashi Kamihigashi

More articles in Journal of Computational Social Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-07-16
Handle: RePEc:spr:jcsosc:v:8:y:2025:i:3:d:10.1007_s42001-025-00409-4