EconPapers    
Economics at your fingertips  
 

Binned Term Count: An Alternative to Term Frequency for Text Categorization

Farhan Shehzad, Abdur Rehman (), Kashif Javed, Khalid A. Alnowibet, Haroon A. Babri and Hafiz Tayyab Rauf ()
Additional contact information
Farhan Shehzad: Department of Computer Science, University of Gujrat, Gujrat 50700, Pakistan
Abdur Rehman: Department of Computer Science, University of Gujrat, Gujrat 50700, Pakistan
Kashif Javed: Department of Electrical Engineering, University of Engineering and Technology, Lahore 54890, Pakistan
Khalid A. Alnowibet: Statistics and Operations Research Department, College of Science, King Saud University, Riyadh 11451, Saudi Arabia
Haroon A. Babri: Department of Electrical Engineering, University of Engineering and Technology, Lahore 54890, Pakistan
Hafiz Tayyab Rauf: Independent Researcher, Bradford BD8 0HS, UK

Mathematics, 2022, vol. 10, issue 21, 1-25

Abstract: In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided t -test on the macro F 1 results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro F 1 value on the three datasets was achieved by BTC-based term weighting schemes.

Keywords: term frequency; term weighting schemes; bag-of-words model; feature representation; text classification (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/10/21/4124/pdf (application/pdf)
https://www.mdpi.com/2227-7390/10/21/4124/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:10:y:2022:i:21:p:4124-:d:963760

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-27
Handle: RePEc:gam:jmathe:v:10:y:2022:i:21:p:4124-:d:963760