Journal of Atmospheric and Environmental Optics, Volume. 12, Issue 3, 230(2017)
Method of Web Page Text Extraction Based on Text Feature and Page Structure
[1] [1] Liu L, Pu C. XWRAP: an XML 2 enable wrapper constructionsystem for the Web information source [C]//Proceedings of the 16th IEEE International Conference onData Engineering, 2000: 611-620.
[2] [2] Ma Ling, Goharian N, Chowdhury A,et al. Extracting unstructured data from template generated Web documents [C]//Proceedings of the 12th International Conference on Information and Knowledge anagement, 2003: 512-515.
[3] [3] Mei Xue, Cheng Xueqi, Guo Yan,et al. Fully automatic Wrapper generation for web information extraction [J]. Journal of Chinese Information Processing, 2008, 22(1): 22-29(in Chinese).
[4] [4] Sun Chengjie, Guan Yi. A statistical approach for content extraction from web page [J].Journal of Chinese Information Processing, 2004, 18(5): 17-22(in Chinese).
[5] [5] Sun Hao, Dong Shoubin. Adaptive approach for content extraction based on tag density [J].Journal of Zhengzhou University, 2009, 41(1): 44-47(in Chinese).
[6] [6] An Zengwen, Wang Chao, Xu Jiefeng. An approach based on machine learning for information extraction method [J].Microcomputer & Its Applications, 2010(12): 4-6(in Chinese).
[7] [7] You Guirong, Lu Yuchang. Extraction of topical information from Chinese web page based on the statistic and machine learning [J].Journal of Fujian Commercial College, 2009, 4(2): 68-72(in Chinese).
Get Citation
Copy Citation Text
HU Lulu, LIU Xiaoqin, SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230
Category:
Received: Jan. 12, 2016
Accepted: --
Published Online: Jun. 9, 2017
The Author Email: Lulu HU (hllyyy@mail.ustc.edu.cn)