Automatic Information Extraction from the Web: An HMM-Based Approach
M. S. Tran-Le (),
T. T. Vo-Dang (),
Quan Ho- Van () and
T. K. Dang ()
Additional contact information
M. S. Tran-Le: University of Technology, Faculty of CSE
T. T. Vo-Dang: University of Technology, Faculty of CSE
Quan Ho- Van: University of Technology, Faculty of CSE
T. K. Dang: University of Technology, Faculty of CSE
A chapter in Modeling, Simulation and Optimization of Complex Processes, 2008, pp 575-585 from Springer
Abstract:
Abstract With the continued growth of the Internet and a huge amount of available data, extracting meaningful information from the Web has got a wide interest in both research community and business organizations. Although there exists a number of previous research works, to the best of our knowledge, none of them is flexible enough to fulfill users’ requirements in a variety of application domains. In this paper, we discuss and propose a general, extensible and dynamic approach based on the Hidden Markov model (HMM) in order to facilitate the efficient information extraction from HTML pages. Our proposed approach helps experts build a HMM from necessary specifications, train the system search engine, and extract meaningful information from HTML pages with the high precision and at a reasonable cost. More importantly, the proposed approach can be employed to support building knowledge bases for the next generation of the Web applications, i.e. the semantic Web. We developed and evaluated this model on a prototype, called PriceSearch, to extract price information of goods such as Nokia mobiles, computer mice, digital cameras. Experimental results confirm the efficiency of our theoretical analyses and approach.
Keywords: Hide Markov Model; Information Extraction; Viterbi Algorithm; Computer Mouse; Input String (search for similar items in EconPapers)
Date: 2008
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-540-79409-7_43
Ordering information: This item can be ordered from
http://www.springer.com/9783540794097
DOI: 10.1007/978-3-540-79409-7_43
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().