Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German
Sascha Wolfer (),
Alexander Koplenig,
Marc Kupietz and
Carolin Müller-Spitzer
Additional contact information
Sascha Wolfer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Alexander Koplenig: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Marc Kupietz: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Carolin Müller-Spitzer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Data, 2023, vol. 8, issue 11, 1-10
Abstract:
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
Keywords: language; n-grams; corpus frequency; dataset; German; vocabulary growth (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2306-5729/8/11/170/pdf (application/pdf)
https://www.mdpi.com/2306-5729/8/11/170/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:8:y:2023:i:11:p:170-:d:1277877
Access Statistics for this article
Data is currently edited by Ms. Cecilia Yang
More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().