Annotated suffix tree as a way of text representation for information retrieval in text collections

D.S., Frolov

Annotated suffix tree as a way of text representation for information retrieval in text collections

Frolov D.S.
Additional contact information
Frolov D.S.: National Research University Higher School of Economics

Бизнес-информатика, 2015, issue 4 (34), 63-70

Abstract: A method for information retrieval based on annotated suffix trees (AST) is presented. The method is based on a string-to-document relevance score calculated using AST as well as fragment reverse indexing for improving performance. We developed a search engine based on the method. This engine is compared with some other popular text aggregating techniques: probabilistic latent semantic indexing (PLSA) and latent Dirichlet allocation (LDA). We used real data for computation experiments: an online store’s xml-catalogs and collections of web pages (both in Russian) and a real user’s queries from the Yandex. Wordstat service. As quality metrics, we used point quality estimations and graphical representations. Our AST-based method generally leads to results that are similar to those obtained by the other methods. However, in the case of inaccurate queries, AST-based results are superior. The speed of the AST-based method is slightly worse than the speed of the PLSA/LDA-based methods. Due to the observed correlation between the average query performing time and the string lengths at the AST construction phase, one can improve the performance of the algorithm by dividing the texts into smaller fragments at the preprocessing stage. However, the quality of search may suffer if the fragments are too short. Therefore, the applicability of annotated suffix tree techniques for text retrieval problems is demonstrated. Moreover, the AST-based method has significant advantages in the case of fuzzy search.

Keywords: TEXT DOCUMENT RETRIEVAL; AGGREGATE TEXT REPRESENTATION; ANNOTATED SUFFIX TREE (AST); PROBABILISTIC LATENT SEMANTIC INDEXING (PLSI); LATENT DIRICHLET ALLOCATION (LDA); FUZZY TEXT SEARCH; ИНФОРМАЦИОННЫЙ ПОИСК В КОЛЛЕКЦИЯХ ТЕКСТОВ; АГРЕГИРОВАННОЕ ПРЕДСТАВЛЕНИЕ ТЕКСТОВ; АННОТИРОВАННОЕ СУФФИКСНОЕ ДЕРЕВО (АСД); ВЕРОЯТНОСТНОЕ ЛАТЕНТНО-СЕМАНТИЧЕСКОЕ ИНДЕКСИРОВАНИЕ (PLSI); СКРЫТОЕ РАЗМЕЩЕНИЕ ДИРИХЛЕ (LDA); НЕЧЕТКИЙ ТЕКСТОВЫЙ ПОИСК (search for similar items in EconPapers)
Date: 2015
References: Add references at CitEc
Citations:

Downloads: (external link)
http://cyberleninka.ru/article/n/annotated-suffix- ... -in-text-collections

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:scn:025686:16374355

Access Statistics for this article

More articles in Бизнес-информатика from CyberLeninka, Федеральное государственное автономное образовательное учреждение высшего образования «Национальный исследовательский университет «Высшая школа экономики»
Bibliographic data for series maintained by CyberLeninka ().