EconPapers    
Economics at your fingertips  
 

Classification of Cotton Genotypes with Mixed Continuous and Categorical Variables: Application of Machine Learning Models

Sudha Bishnoi, Nadhir Al-Ansari, Mujahid Khan, Salim Heddam and Anurag Malik ()
Additional contact information
Sudha Bishnoi: Department of Mathematics and Statistics, Chaudhary Charan Singh Haryana Agricultural University, Hisar 125004, Haryana, India
Nadhir Al-Ansari: Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, 97187 Lulea, Sweden
Mujahid Khan: Agricultural Research Station, Sri Karan Narendra Agriculture University, Jobner 332301, Rajasthan, India
Salim Heddam: Agronomy Department, Faculty of Science, Hydraulics Division University, 20 Août 1955, Route El Hadaik, BP 26, Skikda 21024, Algeria
Anurag Malik: Regional Research Station, Punjab Agricultural University, Bathinda 151001, Punjab, India

Sustainability, 2022, vol. 14, issue 20, 1-17

Abstract: Mixed data is a combination of continuous and categorical variables and occurs frequently in fields such as agriculture, remote sensing, biology, medical science, marketing, etc., but only limited work has been done with this type of data. In this study, data on continuous and categorical characters of 452 genotypes of cotton ( Gossypium hirsutum ) were obtained from an experiment conducted by the Central Institute of Cotton Research (CICR), Sirsa, Haryana (India) during the Kharif season of the year 2018–2019. The machine learning (ML) classifiers/models, namely k-nearest neighbor (KNN), Classification and Regression Tree (CART), C4.5, Naïve Bayes, random forest (RF), bagging, and boosting were considered for cotton genotypes classification. The performance of these ML classifiers was compared to each other along with the linear discriminant analysis (LDA) and logistic regression. The holdout method was used for cross-validation with an 80:20 ratio of training and testing data. The results of the appraisal based on hold-out cross-validation showed that the RF and AdaBoost performed very well, having only two misclassifications with the same accuracy of 97.26% and the error rate of 2.74%. The LDA classifier performed the worst in terms of accuracy, with nine misclassifications. The other performance measures, namely sensitivity, specificity, precision, F1 score, and G-mean, were all together used to find out the best ML classifier among all those considered. Moreover, the RF and AdaBoost algorithms had the highest value of all the performance measures, with 96.97% sensitivity and 97.50% specificity. Thus, these models were found to be the best in classifying the low- and high-yielding cotton genotypes.

Keywords: machine learning classifiers; supervised classification; mixed data; heterogeneous data; cotton genotypes (search for similar items in EconPapers)
JEL-codes: O13 Q Q0 Q2 Q3 Q5 Q56 (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2071-1050/14/20/13685/pdf (application/pdf)
https://www.mdpi.com/2071-1050/14/20/13685/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jsusta:v:14:y:2022:i:20:p:13685-:d:950039

Access Statistics for this article

Sustainability is currently edited by Ms. Alexandra Wu

More articles in Sustainability from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jsusta:v:14:y:2022:i:20:p:13685-:d:950039