A Web Content Extraction Method Base on Punctuation Distribution and HTML Tag Similarity

Gong, Nan; Fan, Chunxiao; Wu, Yuexin; Ming, Yue

A Web Content Extraction Method Base on Punctuation Distribution and HTML Tag Similarity

Nan Gong (), Chunxiao Fan (), Yuexin Wu () and Yue Ming ()
Additional contact information
Nan Gong: Beijing University of Posts and Telecommunications
Chunxiao Fan: Beijing University of Posts and Telecommunications
Yuexin Wu: Beijing University of Posts and Telecommunications
Yue Ming: Beijing University of Posts and Telecommunications

A chapter in LISS 2013, 2015, pp 803-810 from Springer

Abstract: Abstract Currently, web content extraction methods mostly focus on single-theme pages and have poor adaptability for multi-theme pages. In order to overcome this issue, this paper proposed a web content extraction method based on the punctuation distribution and HTML tag similarity. According to the characteristic that most of the punctuation appeared in the main text areas but rarely appeared in noise areas of web pages, an algorithm of obtaining minimum text area was presented. Furthermore, in the case of multi-theme pages, this paper proposed an approach to extract the titles and contents from each theme by further dividing the minimum text area into sub theme areas based on the tag similarity. Experimental results showed that the proposed method can effectively and accurately extract web content in different themes.

Keywords: Web content extraction; Punctuation distribution; Tag similarity; Tag tree (search for similar items in EconPapers)
Date: 2015
References: Add references at CitEc
Citations:

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-642-40660-7_120

Ordering information: This item can be ordered from
http://www.springer.com/9783642406607

DOI: 10.1007/978-3-642-40660-7_120

Access Statistics for this chapter

More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().