NAICS Code Prediction Using Supervised Methods
Christine Oehlert,
Evan Schulz and
Anne Parker
Statistics and Public Policy, 2022, vol. 9, issue 1, 58-66
Abstract:
When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.
Date: 2022
References: Add references at CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
http://hdl.handle.net/10.1080/2330443X.2022.2033654 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:taf:usppxx:v:9:y:2022:i:1:p:58-66
Ordering information: This journal article can be ordered from
http://www.tandfonline.com/pricing/journal/uspp20
DOI: 10.1080/2330443X.2022.2033654
Access Statistics for this article
Statistics and Public Policy is currently edited by Eric Sampson
More articles in Statistics and Public Policy from Taylor & Francis Journals
Bibliographic data for series maintained by Chris Longhurst ().