KazNewsDataset: Single Country Overall Digital Mass Media Publication Corpus
Kirill Yakunin,
Maksat Kalimoldayev,
Ravil I. Mukhamediev,
Rustam Mussabayev,
Vladimir Barakhnin,
Yan Kuchin,
Sanzhar Murzakhmetov,
Timur Buldybayev,
Ulzhan Ospanova,
Marina Yelis,
Akylbek Zhumabayev,
Viktors Gopejenko,
Zhazirakhanym Meirambekkyzy and
Alibek Abdurazakov
Additional contact information
Kirill Yakunin: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Maksat Kalimoldayev: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Ravil I. Mukhamediev: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Rustam Mussabayev: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Vladimir Barakhnin: Federal Research Center for Information and Computational Technologies, 630090 Novosibirsk, Russia
Yan Kuchin: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Sanzhar Murzakhmetov: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Timur Buldybayev: Information-Analytical Center, Nur-Sultan 010000, Kazakhstan
Ulzhan Ospanova: Information-Analytical Center, Nur-Sultan 010000, Kazakhstan
Marina Yelis: Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan
Akylbek Zhumabayev: Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan
Viktors Gopejenko: Department of Natural Science and Computer Technologies, ISMA University, LV-1011 Riga, Latvia
Zhazirakhanym Meirambekkyzy: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Alibek Abdurazakov: Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan
Data, 2021, vol. 6, issue 3, 1-12
Abstract:
Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.
Keywords: natural language processing; mass-media; topic modeling; LDA; ARTM; multiple-criteria decision-making (MCDM); computer modeling; sentiment analysis; significant social news; propaganda identification (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2021
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.mdpi.com/2306-5729/6/3/31/pdf (application/pdf)
https://www.mdpi.com/2306-5729/6/3/31/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:6:y:2021:i:3:p:31-:d:516749
Access Statistics for this article
Data is currently edited by Ms. Cecilia Yang
More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().