EconPapers    
Economics at your fingertips  
 

Explainable Identification of Similarities Between Entities for Discovery in Large Text

Akhil Joshi, Sai Teja Erukude and Lior Shamir ()
Additional contact information
Akhil Joshi: Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA
Sai Teja Erukude: Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA
Lior Shamir: Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA

Future Internet, 2025, vol. 17, issue 4, 1-20

Abstract: With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.

Keywords: text analysis; text similarity; text content retrieval; explainable AI (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/4/135/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/4/135/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:4:p:135-:d:1618120

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-04-05
Handle: RePEc:gam:jftint:v:17:y:2025:i:4:p:135-:d:1618120