Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Lehmann, Klaus; Villaseñor, Elio; Pimentel, Alejandro; Preuss, Javiera; Berhó, Nicolás; Diaz, Oswaldo; Agloni, Ignacio

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

Klaus Lehmann (), Elio Villaseñor (), Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni
Additional contact information
Klaus Lehmann: Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile
Elio Villaseñor: Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico
Alejandro Pimentel: Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico
Javiera Preuss: Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile
Nicolás Berhó: Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile
Oswaldo Diaz: Instituto Nacional de Estadística y Geografía (INEGI), Heroe de Nacozari 2301, Aguascalientes 20276, Mexico
Ignacio Agloni: Instituto Nacional de Estadísticas (INE), Morandé 801, Santiago 8340148, Chile

Stats, 2025, vol. 8, issue 3, 1-22

Abstract: This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices.

Keywords: crime; language model; deep learning; machine learning; automated coding; NLP; national statistical office (search for similar items in EconPapers)
JEL-codes: C1 C10 C11 C14 C15 C16 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2571-905X/8/3/68/pdf (application/pdf)
https://www.mdpi.com/2571-905X/8/3/68/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jstats:v:8:y:2025:i:3:p:68-:d:1713855

Access Statistics for this article

Stats is currently edited by Mrs. Minnie Li

More articles in Stats from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().