Journal of Atmospheric and Environmental Optics, Volume. 12, Issue 3, 230(2017)

Method of Web Page Text Extraction Based on Text Feature and Page Structure

Lulu HU1、*, Xiaoqin LIU1, and Kai SUN2
Author Affiliations
  • 1[in Chinese]
  • 2[in Chinese]
  • show less

    Web information extraction technology has been a hot topic in the field of information technology. Moreover, in recent years, DIV + CSS page layout method was commonly used in web design. Based on this, a simple and practical method for the text extraction of news web pages based on text features and page structure is presented. The text content block on the page is identified and extracted firstly, and then regular expression is used to filter the HTML tag of content block and the main text of the web page is extracted. Experimental results show that the method has great universal property and accuracy rate in text extraction.

    Tools

    Get Citation

    Copy Citation Text

    HU Lulu, LIU Xiaoqin, SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jan. 12, 2016

    Accepted: --

    Published Online: Jun. 9, 2017

    The Author Email: Lulu HU (hllyyy@mail.ustc.edu.cn)

    DOI:10.3969/j.issn.1673-6141.2017.03.009

    Topics