Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER)

Liao, Wei; Voldman, Joel

Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER)

Wei Liao and Joel Voldman

PLOS Digital Health, 2024, vol. 3, issue 10, 1-18

Abstract: Recent work in machine learning for healthcare has raised concerns about patient privacy and algorithmic fairness. Previous work has shown that self-reported race can be predicted from medical data that does not explicitly contain racial information. However, the extent of data identification is unknown, and we lack ways to develop models whose outcomes are minimally affected by such information. Here we systematically investigated the ability of time-series electronic health record data to predict patient static information. We found that not only the raw time-series data, but also learned representations from machine learning models, can be trained to predict a variety of static information with area under the receiver operating characteristic curve as high as 0.851 for biological sex, 0.869 for binarized age and 0.810 for self-reported race. Such high predictive performance can be extended to various comorbidity factors and exists even when the model was trained for different tasks, using different cohorts, using different model architectures and databases. Given the privacy and fairness concerns these findings pose, we develop a variational autoencoder-based approach that learns a structured latent space to disentangle patient-sensitive attributes from time-series data. Our work thoroughly investigates the ability of machine learning models to encode patient static information from time-series electronic health records and introduces a general approach to protect patient-sensitive information for downstream tasks.Author summary: It is increasingly apparent that machine learning for healthcare models can predict sensitive information from data that does not explicitly encode it. Well-known examples include self-reported race from various medical imaging modalities, and age and biological sex from retinal fundus images. These findings in turn raise concerns about introducing biases in models or exacerbating health disparities. However, we lack a clear understanding of the extent of the problem—what types of sensitive information can be predicted, how does it generalize to different models or different datasets—and, critically, approaches to develop models that can make clinical inferences but not infer sensitive information. Here we go beyond these prior studies and thoroughly investigate the ability of machine learning (ML) models to encode a wide range of patient sensitive information from time-series EHR data, and then, critically, provide a strategy to mitigate such inferences.

Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000640 (text/html)
https://journals.plos.org/digitalhealth/article/fi ... 00640&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pdig00:0000640

DOI: 10.1371/journal.pdig.0000640

Access Statistics for this article

More articles in PLOS Digital Health from Public Library of Science
Bibliographic data for series maintained by digitalhealth ().