Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database
Tulsi Suchak,
Anietie E Aliu,
Charlie Harrison,
Reyer Zwiggelaar,
Nophar Geifman and
Matt Spick
PLOS Biology, 2025, vol. 23, issue 5, 1-15
Abstract:
With the growth of artificial intelligence (AI)-ready datasets such as the National Health and Nutrition Examination Survey (NHANES), new opportunities for data-driven research are being created, but also generating risks of data exploitation by paper mills. In this work, we focus on two areas of potential concern for AI-supported research efforts. First, we describe the production of large numbers of formulaic single-factor analyses, relating single predictors to specific health conditions, where multifactorial approaches would be more appropriate. Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills. Second, we identify risks of selective data usage, such as analyzing limited date ranges or cohort subsets without clear justification, suggestive of data dredging, and post-hoc hypothesis formation. Using a systematic literature search for single-factor analyses, we identified 341 NHANES-derived research papers published over the past decade, each proposing an association between a predictor and a health condition from the wide range contained within NHANES. We found evidence that research failed to take account of multifactorial relationships, that manuscripts did not account for the risks of false discoveries, and that researchers selectively extracted data from NHANES rather than utilizing the full range of data available. Given the explosion of AI-assisted productivity in published manuscripts (the systematic search strategy used here identified an average of 4 papers per annum from 2014 to 2021, but 190 in 2024–9 October alone), we highlight a set of best practices to address these concerns, aimed at researchers, data controllers, publishers, and peer reviewers, to encourage improved statistical practices and mitigate the risks of paper mills using AI-assisted workflows to introduce low-quality manuscripts to the scientific literature.The combination of AI and national health databases offers opportunities, but may also be exploited by unethical agents. This study shows that there has been an explosion of formulaic research articles, including inappropriate study designs and false discoveries, that use data from the US NHANES resource, providing a case study in new unethical research practices, including paper mills.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152 (text/html)
https://journals.plos.org/plosbiology/article/file ... 03152&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pbio00:3003152
DOI: 10.1371/journal.pbio.3003152
Access Statistics for this article
More articles in PLOS Biology from Public Library of Science
Bibliographic data for series maintained by plosbiology ().