Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

Huber, Florian; Ridder, Lars; Verhoeven, Stefan; Spaaks, Jurriaan H; Diblen, Faruk; Rogers, Simon; van der Hooft, Justin J J

Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

Florian Huber, Lars Ridder, Stefan Verhoeven, Jurriaan H Spaaks, Faruk Diblen, Simon Rogers and Justin J J van der Hooft

PLOS Computational Biology, 2021, vol. 17, issue 2, 1-18

Abstract: Spectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm—Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.Author summary: Most metabolomics analyses rely upon matching observed fragmentation mass spectra to library spectra for structural annotation or compare spectra with each other through network analysis. As a key part of such processes, scoring functions are used to assess the similarity between pairs of fragment spectra. No studies have so far proposed scores fundamentally different to the popular cosine-based similarity score, despite the fact that its limitations are well understood. We propose a novel spectral similarity score known as Spec2Vec which adapts algorithms from natural language processing to learn relationships between peaks from co-occurrences across large spectra datasets. We find that similarities computed with Spec2Vec i) correlate better to structural similarity than cosine-based scores, ii) subsequently gives better performance in library matching tasks, and iii) is computationally more scalable than cosine-based scores. Given the central place of similarity scoring in key metabolomics analysis tasks such as library matching and spectral networking, we expect Spec2Vec to make a broad impact in all fields that rely upon untargeted metabolomics.

Date: 2021
References: View complete reference list from CitEc
Citations: View citations in EconPapers (6)

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008724 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 08724&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1008724

DOI: 10.1371/journal.pcbi.1008724

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().