EconPapers    
Economics at your fingertips  
 

Silhouette-Based Evaluation of PCA, Isomap, and t-SNE on Linear and Nonlinear Data Structures

Mostafa Zahed () and Maryam Skafyan
Additional contact information
Mostafa Zahed: Department of Mathematics & Statistics, East Tennessee State University (ETSU), Johnson City, TN 37614, USA
Maryam Skafyan: Department of Mathematics & Statistics, East Tennessee State University (ETSU), Johnson City, TN 37614, USA

Stats, 2025, vol. 8, issue 4, 1-52

Abstract: Dimensionality reduction is fundamental for analyzing high-dimensional data, supporting visualization, denoising, and structure discovery. We present a systematic, large-scale benchmark of three widely used methods—Principal Component Analysis (PCA), Isometric Mapping (Isomap), and t-Distributed Stochastic Neighbor Embedding (t-SNE)—evaluated by average silhouette scores to quantify cluster preservation after embedding. Our full factorial simulation varies sample size n ∈ { 100 , 200 , 300 , 400 , 500 } , noise variance σ 2 ∈ { 0.25 , 0.5 , 0.75 , 1 , 1.5 , 2 } , and feature count p ∈ { 20 , 50 , 100 , 200 , 300 , 400 } under four generative regimes: (1) a linear Gaussian mixture, (2) a linear Student- t mixture with heavy tails, (3) a nonlinear Swiss-roll manifold, and (4) a nonlinear concentric-spheres manifold, each replicated 1000 times per condition. Beyond empirical comparisons, we provide mathematical results that explain the observed rankings: under standard separation and sampling assumptions, PCA maximizes silhouettes for linear, low-rank structure, whereas Isomap dominates on smooth curved manifolds; t-SNE prioritizes local neighborhoods, yielding strong local separation but less reliable global geometry. Empirically, PCA consistently achieves the highest silhouettes for linear structure (Isomap second, t-SNE third); on manifolds the ordering reverses (Isomap > t-SNE > PCA). Increasing σ 2 and adding uninformative dimensions (larger p ) degrade all methods, while larger n improves levels and stability. To our knowledge, this is the first integrated study combining a comprehensive factorial simulation across linear and nonlinear regimes with distribution-based summaries (density and violin plots) and supporting theory that predicts method orderings. The results offer clear, practice-oriented guidance: prefer PCA when structure is approximately linear; favor manifold learning—especially Isomap—when curvature is present; and use t-SNE for the exploratory visualization of local neighborhoods. Complete tables and replication materials are provided to facilitate method selection and reproducibility.

Keywords: dimension reduction techniques; linear and nonlinear data structures; Principal Component Analysis (PCA); Isomap; t-Distributed Stochastic Neighbor Embedding (t-SNE) (search for similar items in EconPapers)
JEL-codes: C1 C10 C11 C14 C15 C16 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2571-905X/8/4/105/pdf (application/pdf)
https://www.mdpi.com/2571-905X/8/4/105/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jstats:v:8:y:2025:i:4:p:105-:d:1786641

Access Statistics for this article

Stats is currently edited by Mrs. Minnie Li

More articles in Stats from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-11-08
Handle: RePEc:gam:jstats:v:8:y:2025:i:4:p:105-:d:1786641