EconPapers    
Economics at your fingertips  
 

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Sascha Wolfer (), Alexander Koplenig, Marc Kupietz and Carolin Müller-Spitzer
Additional contact information
Sascha Wolfer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Alexander Koplenig: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Marc Kupietz: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Carolin Müller-Spitzer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany

Data, 2023, vol. 8, issue 11, 1-10

Abstract: We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Keywords: language; n-grams; corpus frequency; dataset; German; vocabulary growth (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/8/11/170/pdf (application/pdf)
https://www.mdpi.com/2306-5729/8/11/170/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:8:y:2023:i:11:p:170-:d:1277877

Access Statistics for this article

Data is currently edited by Ms. Cecilia Yang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jdataj:v:8:y:2023:i:11:p:170-:d:1277877