A Web Content Extraction Method Base on Punctuation Distribution and HTML Tag Similarity
Nan Gong (),
Chunxiao Fan (),
Yuexin Wu () and
Yue Ming ()
Additional contact information
Nan Gong: Beijing University of Posts and Telecommunications
Chunxiao Fan: Beijing University of Posts and Telecommunications
Yuexin Wu: Beijing University of Posts and Telecommunications
Yue Ming: Beijing University of Posts and Telecommunications
A chapter in LISS 2013, 2015, pp 803-810 from Springer
Abstract:
Abstract Currently, web content extraction methods mostly focus on single-theme pages and have poor adaptability for multi-theme pages. In order to overcome this issue, this paper proposed a web content extraction method based on the punctuation distribution and HTML tag similarity. According to the characteristic that most of the punctuation appeared in the main text areas but rarely appeared in noise areas of web pages, an algorithm of obtaining minimum text area was presented. Furthermore, in the case of multi-theme pages, this paper proposed an approach to extract the titles and contents from each theme by further dividing the minimum text area into sub theme areas based on the tag similarity. Experimental results showed that the proposed method can effectively and accurately extract web content in different themes.
Keywords: Web content extraction; Punctuation distribution; Tag similarity; Tag tree (search for similar items in EconPapers)
Date: 2015
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-642-40660-7_120
Ordering information: This item can be ordered from
http://www.springer.com/9783642406607
DOI: 10.1007/978-3-642-40660-7_120
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().