Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification
Hannah Kockelbergh,
Shelley C Evans,
Liam Brierley,
Peter L Green,
Andrea L Jorgensen,
Elizabeth J Soilleux and
Anna Fowler
PLOS Computational Biology, 2026, vol. 22, issue 4, 1-28
Abstract:
Insights gained through interpretation of models trained on the T-cell receptor (TCR) repertoire contribute to advances in understanding of immune-mediated disease. This has the potential to improve diagnostic tests and treatments, particularly for autoimmune diseases. However, TCR repertoire datasets with samples from donors of known autoimmune disease status generally include orders of magnitude fewer samples than TCR sequences. Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. In particular, kmer methods demonstrate strong and stable performance for small datasets. We propose a TCR repertoire representation that considers the relationships between amino acids within kmers flexibly and efficiently. XGBoost and logistic regression models are trained and tested on kmer representations of TCR repertoire datasets including samples from patients with coeliac disease as well as donors with previous cytomegalovirus infection. XGBoost models outperform logistic regression, indicating that interactions may be crucial for discriminative ability. We find that a reduced alphabet based on BLOSUM62 can lead to a model with slightly stronger XGBoost testing performance than other kmer features. Though it remains unclear whether there is an amino acid encoding that can substantially improve TCR repertoire classification with reduced alphabet kmers, evidence that this representation enables faster training of XGBoost models in comparison to kmer clusters suggests that our reduced alphabet approach permits wider exploration of amino acid similarity in practice. Finally, we detail motifs which are important in each top-performing XGBoost model and compare them to TCR sequences previously associated with each immune status. We highlight the challenge of interpreting non-linear TCR repertoire classification models trained on kmers which, if overcome, could lead to biomarker discovery for autoimmune diseases.Author summary: TCR repertoire classification models can provide valuable understanding of autoimmune diseases if they can accurately infer autoimmune disease status and are biologically interpretable. Based on a kmer representation of the TCR repertoire, which has been shown to be most appropriate to train classification models on smaller datasets out of three popular approaches, we develop a computationally efficient method of grouping amino acid sequences to add knowledge to immune status classification model inputs. We find that most of the 4mer-based feature types we tested perform well in combination with an XGBoost model, and that applying a halved alphabet of amino acids based on BLOSUM62 may be beneficial or neutral for immune status classification performance. We also consider the effect on models and features on interpretability, and conclude that although some insights may be gained from inspecting feature importance, dedicated explanatory methods are required to truly understand the complex relationships between kmers that are captured by our best-performing XGBoost models. While standard kmer XGBoost models have the shortest training time, our proposed reduced alphabet methodology presents a more efficient alternative to kmer clustering. Future exploration of amino acid similarity with encodings other than those based on Atchley factors or BLOSUM62, as well as length of kmers k, would benefit from our reduced alphabet representation over clustering of kmers.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1014211 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 14211&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1014211
DOI: 10.1371/journal.pcbi.1014211
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().