EconPapers    
Economics at your fingertips  
 

Computing generalized cophenetic distances under all Lp norms: A near-linear time algorithmic framework

Paweł Górecki, Alexey Markin, Sriram Vijendran and Oliver Eulenstein

PLOS Computational Biology, 2025, vol. 21, issue 6, 1-21

Abstract: The cophenetic distance is a well-established metric in biology used to compare pairs of trees represented in a vector format. This distance was introduced by Cardona and his co-authors, building on the foundational work of Sokal and Rohlf, which dates back over 60 years. It is widely recognized for its versatility since it can analyze trees with edge weights using various vector norms. However, when comparing large-scale trees, the quadratic runtime of the current best-known (i.e., naïve) algorithm for computing the cophenetic distance can become prohibitive. Recently, a new algorithmic framework with near-linear time complexity has been developed to calculate the distances of a generalized class of cophenetic distances, which are derived from the work of Sokal and Rohlf. This improvement not only allows the cophenetic distance to be utilized in large-scale studies but also enhances the versatility of these studies by incorporating generalized variants of the cophenetic distance. However, the framework is limited to applying only the L1 and L2 vector norms, which significantly restricts the versatility of generalized cophenetic distances in large-scale applications. To address this limitation, we present a near-linear time algorithmic framework for computing the generalized cophenetic distances across all Lp vector norms. In our scalability study, we showcase the practical performance of our unrestricted algorithmic framework. Furthermore, we investigate the applicability of the generalized cophenetic distances by analyzing the distributions of key components of these distances under various vector norms.Author summary: Biological research often relies on large-scale comparisons of evolutionary trees to extract valuable insights across different subfields of biology. To effectively compare trees on a large scale, sensitive metrics are needed to assess differences in topology and branch lengths alongside efficient computational algorithms. In this study, we focus on the classic cophenetic distance, a metric that compares pairs of trees represented in vector format. The cophenetic distance can analyze trees with edge weights using various vector norms, making it highly versatile. Recently, a fast algorithm was developed to calculate generalized cophenetic distances, including the cophenetic distance itself, which has enabled large-scale studies and expanded the types of cophenetic distances applicable for tree pair comparisons. However, this algorithm is limited to L1 and L2 norms, which greatly reduces the versatility of the cophenetic distance. To overcome this limitation, we introduce a fast algorithm that computes the generalized cophenetic distance for all vector norms, making it suitable for large-scale studies. We also conduct a scalability study to demonstrate the effectiveness of our algorithm in practice, and we analyze the distributions of key representatives of cophenetic distances across various vector norms.

Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1013069 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 13069&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1013069

DOI: 10.1371/journal.pcbi.1013069

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-06-21
Handle: RePEc:plo:pcbi00:1013069