Neural Architecture Comparison for Bibliographic Reference Segmentation: An Empirical Study

Hidalgo, Rodrigo Cuéllar; Elías, Raúl Pinto; Torres-Moreno, Juan-Manuel; Villegas, Osslan Osiris Vergara; Salgado, Gerardo Reyes; Salazar, Andrea Magadán

Neural Architecture Comparison for Bibliographic Reference Segmentation: An Empirical Study

Rodrigo Cuéllar Hidalgo, Raúl Pinto Elías, Juan-Manuel Torres-Moreno (), Osslan Osiris Vergara Villegas, Gerardo Reyes Salgado and Andrea Magadán Salazar
Additional contact information
Rodrigo Cuéllar Hidalgo: Biblioteca Daniel Cosío Villegas, El Colegio de México, Carretera Picacho Ajusco 20, Mexico City 14110, Mexico
Raúl Pinto Elías: Tecnológico Nacional de México/CENIDET, Cuernavaca 62490, Mexico
Juan-Manuel Torres-Moreno: Laboratoire Informatique d’Avignon, Université d’Avignon, 339 Chemin des Meinajariès, CEDEX 9, 84911 Avignon, France
Osslan Osiris Vergara Villegas: Industrial and Manufacturing Engineering Department, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez 32310, Mexico
Gerardo Reyes Salgado: Departamento de Informática y Estadística, Universidad Rey Juan Carlos, Av. del Alcalde de Móstoles, 28933 Madrid, Spain
Andrea Magadán Salazar: Tecnológico Nacional de México/CENIDET, Cuernavaca 62490, Mexico

Data, 2024, vol. 9, issue 5, 1-24

Abstract: In the realm of digital libraries, efficiently managing and accessing scientific publications necessitates automated bibliographic reference segmentation. This study addresses the challenge of accurately segmenting bibliographic references, a task complicated by the varied formats and styles of references. Focusing on the empirical evaluation of Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM + CRF), and Transformer Encoder with CRF (Transformer + CRF) architectures, this research employs Byte Pair Encoding and Character Embeddings for vector representation. The models underwent training on the extensive Giant corpus and subsequent evaluation on the Cora Corpus to ensure a balanced and rigorous comparison, maintaining uniformity across embedding layers, normalization techniques, and Dropout strategies. Results indicate that the BiLSTM + CRF architecture outperforms its counterparts by adeptly handling the syntactic structures prevalent in bibliographic data, achieving an F1-Score of 0.96. This outcome highlights the necessity of aligning model architecture with the specific syntactic demands of bibliographic reference segmentation tasks. Consequently, the study establishes the BiLSTM + CRF model as a superior approach within the current state-of-the-art, offering a robust solution for the challenges faced in digital library management and scholarly communication.

Keywords: reference mining; BiLSTM; transformers; byte-pair encoding; Conditional Random Fields (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/9/5/71/pdf (application/pdf)
https://www.mdpi.com/2306-5729/9/5/71/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:9:y:2024:i:5:p:71-:d:1397326

Access Statistics for this article

Data is currently edited by Ms. Becky Zhang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().