Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia
Nouf Al-Shenaifi,
Aqil M. Azmi () and
Manar Hosny
Additional contact information
Nouf Al-Shenaifi: Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Aqil M. Azmi: Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Manar Hosny: Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Mathematics, 2024, vol. 12, issue 19, 1-18
Abstract:
This study harnesses the linguistic diversity of Arabic dialects to create two expansive corpora from X (formerly Twitter). The Gulf Arabic Corpus (GAC-6) includes around 1.7 million tweets from six Gulf countries—Saudi Arabia, UAE, Qatar, Oman, Kuwait, and Bahrain—capturing a wide range of linguistic variations. The Saudi Dialect Corpus (SDC-5) comprises 790,000 tweets, offering in-depth insights into five major regional dialects of Saudi Arabia: Hijazi, Najdi, Southern, Northern, and Eastern, reflecting the complex linguistic landscape of the region. Both corpora are thoroughly annotated with dialect-specific seed words and geolocation data, achieving high levels of accuracy, as indicated by Cohen’s Kappa scores of 0.78 for GAC-6 and 0.90 for SDC-5. The annotation process leverages AI-driven techniques, including machine learning algorithms for automated dialect recognition and feature extraction, to enhance the granularity and precision of the data. These resources significantly contribute to the field of Arabic dialectology and facilitate the development of AI algorithms for linguistic data analysis, enhancing AI system design and efficiency. The data provided by this research are crucial for advancing AI methodologies, supporting diverse applications in the realm of next-generation AI technologies.
Keywords: Arabic dialects; Arabic corpora; Twitter; dialect identification; lexicon (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/19/3120/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/19/3120/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:19:p:3120-:d:1492721
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().