Machine learning classification of entrepreneurs in British historical census data

Montebruno, Piero; Bennett, Robert; Smith, Harry; van Lieshout, Carry

Machine learning classification of entrepreneurs in British historical census data

Piero Montebruno, Robert Bennett, Harry Smith and Carry van Lieshout

MPRA Paper from University Library of Munich, Germany

Abstract: This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851-1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.

Keywords: machine learning; deep learning; logistic regression; classification; big data; census (search for similar items in EconPapers)
JEL-codes: M13 N83 (search for similar items in EconPapers)
Date: 2019-08-02
New Economics Papers: this item is included in nep-big, nep-cmp, nep-ent and nep-his
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (5)

Published in Information Processing & Management 3.57(2020): pp. 102210

Downloads: (external link)
https://mpra.ub.uni-muenchen.de/100469/1/MPRA_paper_100469.pdf original version (application/pdf)
https://mpra.ub.uni-muenchen.de/106931/49/MPRA_paper_106931.pdf revised version (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:pra:mprapa:100469

Access Statistics for this paper

More papers in MPRA Paper from University Library of Munich, Germany Ludwigstraße 33, D-80539 Munich, Germany. Contact information at EDIRC.
Bibliographic data for series maintained by Joachim Winter ().