Journal of Atmospheric and Environmental Optics, Volume. 12, Issue 3, 230(2017)

Method of Web Page Text Extraction Based on Text Feature and Page Structure

Lulu HU1,*... Xiaoqin LIU1 and Kai SUN2 |Show fewer author(s)
Author Affiliations
  • 1[in Chinese]
  • 2[in Chinese]
  • show less
    References(7)

    [1] [1] Liu L, Pu C. XWRAP: an XML 2 enable wrapper constructionsystem for the Web information source [C]//Proceedings of the 16th IEEE International Conference onData Engineering, 2000: 611-620.

    [2] [2] Ma Ling, Goharian N, Chowdhury A,et al. Extracting unstructured data from template generated Web documents [C]//Proceedings of the 12th International Conference on Information and Knowledge anagement, 2003: 512-515.

    [3] [3] Mei Xue, Cheng Xueqi, Guo Yan,et al. Fully automatic Wrapper generation for web information extraction [J]. Journal of Chinese Information Processing, 2008, 22(1): 22-29(in Chinese).

    [4] [4] Sun Chengjie, Guan Yi. A statistical approach for content extraction from web page [J].Journal of Chinese Information Processing, 2004, 18(5): 17-22(in Chinese).

    [5] [5] Sun Hao, Dong Shoubin. Adaptive approach for content extraction based on tag density [J].Journal of Zhengzhou University, 2009, 41(1): 44-47(in Chinese).

    [6] [6] An Zengwen, Wang Chao, Xu Jiefeng. An approach based on machine learning for information extraction method [J].Microcomputer & Its Applications, 2010(12): 4-6(in Chinese).

    [7] [7] You Guirong, Lu Yuchang. Extraction of topical information from Chinese web page based on the statistic and machine learning [J].Journal of Fujian Commercial College, 2009, 4(2): 68-72(in Chinese).

    Tools

    Get Citation

    Copy Citation Text

    HU Lulu, LIU Xiaoqin, SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jan. 12, 2016

    Accepted: --

    Published Online: Jun. 9, 2017

    The Author Email: Lulu HU (hllyyy@mail.ustc.edu.cn)

    DOI:10.3969/j.issn.1673-6141.2017.03.009

    Topics