Clustering and Representative Selection for High-Dimensional Data with Human-in-the-Loop
Sheng-Tao Yang (),
Jye-Chyi Lu () and
Yu-Chung Tsao ()
Additional contact information
Sheng-Tao Yang: Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Jye-Chyi Lu: Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30339
Yu-Chung Tsao: Department of Industrial Management, National Taiwan University of Science and Technology, Taipei City 106, Taiwan
INFORMS Joural on Data Science, 2025, vol. 4, issue 2, 154-172
Abstract:
This article proposes a novel decision-making procedure called human-in-the-loop clustering and representative selection (HITL-CARS) that involves users’ domain knowledge for analyzing high-dimensional data sets. The proposed method simultaneously clusters strongly correlated variables and estimates a linear regression model with only a few selected variables from cluster representatives and independent variables. In this work, we model the CARS procedure as a mixed-integer programming problem on the basis of penalized likelihood and partition around medoids clustering. After users obtain analysis results from CARS and provide their advice based on their domain knowledge, HITL-CARS refines analyses for accounting users’ inputs. Simulation studies show that the one-stage CARS performs better than the two-stage group Lasso and clustering representative Lasso in metrics such as true-positive, false-positive, exchangeable representative selection, and so on. Additionally, sensitivity and parameter misspecification studies present the robustness of the CARS to different preset parameters and provide guidance on how to start and adjust the HILT-CARS procedure. A real-life example of brain mapping data shows that HITL-CARS could aid in discovering important brain regions associated with depression symptoms and provide predictive analytics on cluster representatives.
Keywords: interactive machine learning; Lasso; mixed-integer programming; partition around medoids; large p small n variable selection (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/ijds.2022.9014 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:orijds:v:4:y:2025:i:2:p:154-172
Access Statistics for this article
More articles in INFORMS Joural on Data Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().