A Scalable Classification Algorithm for Very Large Datasets

Delen, Dursun; Kletke, Marilyn G.; Kim, Jin-Hwa

A Scalable Classification Algorithm for Very Large Datasets

Dursun Delen (), Marilyn G. Kletke and Jin-Hwa Kim
Additional contact information
Dursun Delen: Department of Management Science and Information Systems, Spears School of Business, Oklahoma State University, Tulsa, Ok, USA
Marilyn G. Kletke: Department of Management Science and Information Systems, Spears School of Business, Oklahoma State University, Tulsa, Ok, USA
Jin-Hwa Kim: School of Business, Sogang University, Seoul, Korea

Journal of Information & Knowledge Management (JIKM), 2005, vol. 04, issue 02, 83-94

Abstract: Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately600Krecords for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

Keywords: Massive datasets; data mining; rule induction; classification; knowledge bases; refinement techniques (search for similar items in EconPapers)
Date: 2005
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219649205001092
Access to full text is restricted to subscribers

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wsi:jikmxx:v:04:y:2005:i:02:n:s0219649205001092

Ordering information: This journal article can be ordered from

DOI: 10.1142/S0219649205001092

Access Statistics for this article

Journal of Information & Knowledge Management (JIKM) is currently edited by Professor Suliman Hawamdeh

More articles in Journal of Information & Knowledge Management (JIKM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().