Equiflow: An open-source software package for evaluating changes in cohort composition
Jacob Gould Ellen,
Chrystinne Fernandes,
Martin Viola,
Keagan Yap,
Arinda Jordan,
Mutesi Flavia Kirabo,
João Matos,
Pedro Moreira and
Leo Anthony Celi
PLOS Digital Health, 2026, vol. 5, issue 4, 1-15
Abstract:
Clinical research studies routinely apply exclusion criteria and data preprocessing steps that can substantially alter dataset composition, potentially introducing hidden biases that affect validity and generalizability. This is particularly important in artificial intelligence/machine learning (AI/ML) studies where models learn patterns directly from training data. We developed Equiflow, an open-source Python package that automates creation of enhanced participant flow diagrams tracking both sample size and composition changes throughout studies. Equiflow quantifies distributional shifts at each exclusion step and generates visualizations showing how key clinical and demographic variables evolve during participant selection. In a case study of sepsis patients from the eICU database, sequential exclusions reduced the sample from 126,750–1,094 patients. Requiring non-missing troponin measurements in the final step of data processing caused substantial demographic shifts that would typically remain invisible in traditional reporting. By making compositional biases visible during cohort construction before modeling begins, Equiflow enables researchers to make informed decisions about analyses and acknowledge limitations in generalizability to their readers. This standardized, open-source approach promotes transparency in clinical research and supports development of more equitable clinical AI systems, addressing a critical need as healthcare increasingly relies on data-driven decision making.Author summary: Medical research studies filter participants through multiple steps, often removing those with missing data, applying clinical criteria, or excluding based on demographic factors. While each step may seem routine, the cumulative effect can dramatically reshape who remains in the final dataset, introducing hidden biases that undermine study validity and generalizability. This problem is particularly concerning in AI applications, where algorithms learn directly from training data and can perpetuate healthcare disparities. We developed Equiflow, a free, open-source Python tool that automatically generates visual diagrams tracking how a study population changes at each filtering step. Unlike traditional reporting methods that show only participant counts, Equiflow reveals compositional shifts, such as whether excluding patients with missing lab values disproportionately removes certain demographic groups. We describe two case studies using real-world ICU data showing how routine exclusion criteria can alter fundamental characteristics of a cohort. These shifts, invisible in standard reporting, could affect which patients benefit from resulting clinical tools. By making such biases visible early in the research process, Equiflow enables researchers to make informed decisions and transparently acknowledge limitations in their findings.
Date: 2026
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001342 (text/html)
https://journals.plos.org/digitalhealth/article/fi ... 01342&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pdig00:0001342
DOI: 10.1371/journal.pdig.0001342
Access Statistics for this article
More articles in PLOS Digital Health from Public Library of Science
Bibliographic data for series maintained by digitalhealth ().