Silhouette-Based Evaluation of PCA, Isomap, and t-SNE on Linear and Nonlinear Data Structures
Mostafa Zahed () and
Maryam Skafyan
Additional contact information
Mostafa Zahed: Department of Mathematics & Statistics, East Tennessee State University (ETSU), Johnson City, TN 37614, USA
Maryam Skafyan: Department of Mathematics & Statistics, East Tennessee State University (ETSU), Johnson City, TN 37614, USA
Stats, 2025, vol. 8, issue 4, 1-52
Abstract:
Dimensionality reduction is fundamental for analyzing high-dimensional data, supporting visualization, denoising, and structure discovery. We present a systematic, large-scale benchmark of three widely used methods—Principal Component Analysis (PCA), Isometric Mapping (Isomap), and t-Distributed Stochastic Neighbor Embedding (t-SNE)—evaluated by average silhouette scores to quantify cluster preservation after embedding. Our full factorial simulation varies sample size n ∈ { 100 , 200 , 300 , 400 , 500 } , noise variance σ 2 ∈ { 0.25 , 0.5 , 0.75 , 1 , 1.5 , 2 } , and feature count p ∈ { 20 , 50 , 100 , 200 , 300 , 400 } under four generative regimes: (1) a linear Gaussian mixture, (2) a linear Student- t mixture with heavy tails, (3) a nonlinear Swiss-roll manifold, and (4) a nonlinear concentric-spheres manifold, each replicated 1000 times per condition. Beyond empirical comparisons, we provide mathematical results that explain the observed rankings: under standard separation and sampling assumptions, PCA maximizes silhouettes for linear, low-rank structure, whereas Isomap dominates on smooth curved manifolds; t-SNE prioritizes local neighborhoods, yielding strong local separation but less reliable global geometry. Empirically, PCA consistently achieves the highest silhouettes for linear structure (Isomap second, t-SNE third); on manifolds the ordering reverses (Isomap > t-SNE > PCA). Increasing σ 2 and adding uninformative dimensions (larger p ) degrade all methods, while larger n improves levels and stability. To our knowledge, this is the first integrated study combining a comprehensive factorial simulation across linear and nonlinear regimes with distribution-based summaries (density and violin plots) and supporting theory that predicts method orderings. The results offer clear, practice-oriented guidance: prefer PCA when structure is approximately linear; favor manifold learning—especially Isomap—when curvature is present; and use t-SNE for the exploratory visualization of local neighborhoods. Complete tables and replication materials are provided to facilitate method selection and reproducibility.
Keywords: dimension reduction techniques; linear and nonlinear data structures; Principal Component Analysis (PCA); Isomap; t-Distributed Stochastic Neighbor Embedding (t-SNE) (search for similar items in EconPapers)
JEL-codes: C1 C10 C11 C14 C15 C16 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2571-905X/8/4/105/pdf (application/pdf)
https://www.mdpi.com/2571-905X/8/4/105/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jstats:v:8:y:2025:i:4:p:105-:d:1786641
Access Statistics for this article
Stats is currently edited by Mrs. Minnie Li
More articles in Stats from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().