Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach
Jacob Carlson
Papers from arXiv.org
Abstract:
Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "dictionaries" of concepts; computes statistics of dictionary entries for testing relevant concept-level hypotheses; performs selective inference on these hypotheses using algorithms validated by new results in high-dimensional central limit theory, producing a selected set ("discoveries"); and both generates and evaluates human-interpretable natural language descriptions of these discoveries. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement -- both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.
Date: 2025-11, Revised 2026-01
New Economics Papers: this item is included in nep-cmp and nep-ecm
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://arxiv.org/pdf/2511.01680 Latest version (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:arx:papers:2511.01680
Access Statistics for this paper
More papers in Papers from arXiv.org
Bibliographic data for series maintained by arXiv administrators ().