A taxonomy for detecting and preventing temporal data leakage in machine learning-based build prediction: A dual-platform empirical validation

Mishra, Lalit Narayan; Rangari, Amit; Nagrare, Sandesh; Nayak, Saroj Kumar

A taxonomy for detecting and preventing temporal data leakage in machine learning-based build prediction: A dual-platform empirical validation

Lalit Narayan Mishra, Amit Rangari, Sandesh Nagrare and Saroj Kumar Nayak

PLOS ONE, 2026, vol. 21, issue 5, 1-27

Abstract: Modern software development relies on automated build systems that compile and test code whenever developers make changes. Predicting whether these builds will succeed or fail before execution could save computational resources and developer time. However, many machine learning models for build prediction suffer from temporal data leakage, a methodological flaw where the model inadvertently uses information that would only be available after the build completes, producing artificially inflated accuracy that fails in real-world deployment. This study develops a three-type taxonomy to systematically identify and prevent such leakage: (1) Direct Outcome Encoding (using the build result itself as a feature), (2) Execution-Dependent Metrics (information generated during build execution), and (3) Future Information Leakage (using data from chronologically later builds). Applying this taxonomy reveals that prior studies reporting 95–99% accuracy likely used contaminated features, while realistic accuracy is substantially lower. The methodology is validated on 175,706 builds from two open-source CI/CD platforms spanning 10 years: TravisTorrent (100,000 builds, 2013–2017) and GHALogs (75,706 workflows, 2023). Removing leaky features reduces accuracy by 15.07 percentage points on TravisTorrent (97.8% to 82.73%) but only 0.48 points on GHALogs (83.77% to 83.30%), revealing that modern GitHub Actions’ tight integration with repositories enables accurate prediction from static project metadata alone. Using only legitimately available pre-build features, Random Forest classifiers achieve 82.73% (TravisTorrent) and 83.30% (GHALogs) accuracy, sufficient for practical deployment. Surprisingly, project maturity and build history prove more predictive than code complexity metrics, suggesting organizational factors outweigh code quality. The models generalize across programming languages (Java, Ruby, Python, JavaScript) with minimal performance variation. Open-source tools for detecting temporal leakage in any software prediction task are provided.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0340167 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 40167&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0340167

DOI: 10.1371/journal.pone.0340167

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().