EconPapers    
Economics at your fingertips  
 

Compression-based distance between string data and its application to literary work classification based on authorship

Masaki Ishikawa and Hajime Kawakami ()

Computational Statistics, 2013, vol. 28, issue 2, 873 pages

Abstract: There are many well-known document classification/clustering algorithms. In this paper, compression-based distances between documents are focused on, in particular, the normalized compression distance (NCD). The NCD is a popular and powerful metric between strings. A new distance $$D_\alpha $$ with one parameter $$\alpha $$ between strings is designed on the basis of the NCD, and several properties of $$D_\alpha $$ are studied. It is also proved that every pair of strings $$(x,y)$$ can be plotted on the contour graphs of NCD and $$D_\alpha $$ (and some other compression-based distances) in a 2-dimensional plane. The distance $$D_\alpha (x,y)$$ is defined to take a relatively small value if a string $$x$$ is a portion of a string $$y.$$ Literary works $$x$$ and $$y$$ are usually assumed to be written by the same author(s) if $$x$$ is a portion of $$y.$$ Therefore, it may be appropriate to consider the performance of $$D_\alpha $$ for literary work classification based on authorship, as a benchmark. An algorithm to determine an appropriate value of $$\alpha $$ is presented using the contour graphs, and this algorithm does not require the knowledge of the names of the authors of each work. According to experimental results of the area under receiver operating characteristics curves and clustering, $$D_\alpha $$ with such an appropriate value of $$\alpha $$ performs somewhat better in literary work classification based on authorship. Copyright Springer-Verlag 2013

Keywords: Compression-based distance; Normalized compression distance (NCD); Literary work classification; Contour graph; Data analysis (search for similar items in EconPapers)
Date: 2013
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://hdl.handle.net/10.1007/s00180-012-0332-2 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:compst:v:28:y:2013:i:2:p:851-873

Ordering information: This journal article can be ordered from
http://www.springer.com/statistics/journal/180/PS2

DOI: 10.1007/s00180-012-0332-2

Access Statistics for this article

Computational Statistics is currently edited by Wataru Sakamoto, Ricardo Cao and Jürgen Symanzik

More articles in Computational Statistics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:compst:v:28:y:2013:i:2:p:851-873