L'analyse lexicale au service de la cliodynamique: traitement par intelligence artificielle de la base Google Ngram
Jérôme Baray (),
Albert da Silva and
Jean-Marc Leblanc ()
Additional contact information
Jérôme Baray: IRG - Institut de Recherche en Gestion - UPEM - Université Paris-Est Marne-la-Vallée - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12
Jean-Marc Leblanc: CEDITEC - Centre d'Etudes des discours, Images, Textes, Ecrits, Communications - UPEC UP12 - Université Paris-Est Créteil Val-de-Marne - Paris 12
Post-Print from HAL
Abstract:
Cliodynamics is a fairly recent research field that considers history as an object of scientific study. Thanks to its transdisciplinary nature, cliodynamics tries to explain historical dynamical processes such as the rise or collapse of empires or civilizations, economic cycles, population booms, fashions through mathematical modeling, datamining, econometrics or cultural sociology. "Big data" aggregating historical, archaeological or economic informations is the material to feed these quantitative models. It can also incluse empirical analysis to validate assumptions and predictions of dynamic models using historical data. Cliodynamics is part of the cliometrics approach or "new economic history" which studies history through econometrics. Objectives On the one hand, we designed a robust lexical analysis method able to deal with a very large dated corpus series whose content evolves over time (big data) with the challenge of identifying societal evolutions and major historical periods in a cliodynamics perspective. Lexical analysis also examined the teachings to be learned from the Google books Ngram database, which details the number of annual words occurrences in scanned publications available in the Google Books search engine . It is assumed that this database has compiled about 20% of all books ever published in major languages. We focused our study on English-language books published in the United States and Great Britain. The objective was to identify the words frequencies evolving from year 1860 to 2008. Method Principles The method was to constitute, as a first step, a dictionary of the most commonly used English words, disregarding two-way terms, preposition, articles, pronouns. This dictionary has collected 1592 words covering many aspects of social and cultural life with terms related to politics, religion, arts and sciences, industry, objects, family and sentiments. In a second step, the percentage representation of each word in the dictionary was determined for each year after loading the huge Ngram Google Books (1-gram) database on Postgresql. Some words like "king" or "queen" are very well represented in the 19th century dictionary with the reign and power of royalties in Europe, but the use of these phrases declined in the 20th century. The words frequency in books is constantly evolving as time goes by. The third step was to perform a centered and standardized principal component analysis (PCA) on the table describing the representation of words in % by years from 1860 to 2008. A clustering of "years" is carried out using a neural network (artificial intelligence Kohonen map). The results show 8 different periods in history according to 3 different major tendancies in speeches : Humanist versus Scientific ; Chaos versus Organization ; Individualist versus Collectivist.
Keywords: Google Ngram; intelligence artificielle; big data; cliodynamique; analyse lexicale (search for similar items in EconPapers)
Date: 2017-11-24
New Economics Papers: this item is included in nep-big and nep-his
Note: View the original document on HAL open archive server: https://hal.science/hal-01648487v1
References: Add references at CitEc
Citations:
Published in Eclavit Workshop Analyse et représentation de données textuelles expériences d’interaction entre concepteurs et utilisateurs, Nov 2017, Marne la Vallée, France. 2017
Downloads: (external link)
https://hal.science/hal-01648487v1/document (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:hal:journl:hal-01648487
Access Statistics for this paper
More papers in Post-Print from HAL
Bibliographic data for series maintained by CCSD ().