44
Views
22
CrossRef citations to date
0
Altmetric
Original Articles

N-grams based feature selection and text representation for Chinese Text Classification

Zhihua Wei Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China E-mail: zhihua.wei@hotmail.com, miaoduoqian@163.com, zhaorui1@126.com; Key laboratory “Embedded System and Service Computing” Ministry of Education, Tongji University, Cao'an Road,4800, Shanghai, 201804, China; Université de Lyon, Laboratoire ERIC-Lyon2, avenue Pierre Mendès-France, 5, Bron Cedex, 69676, France E-mail: jean-hugues.chauchat@univ-lyon2.fr
,
Duoqian Miao Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China E-mail: zhihua.wei@hotmail.com, miaoduoqian@163.com, zhaorui1@126.com; Key laboratory “Embedded System and Service Computing” Ministry of Education, Tongji University, Cao'an Road,4800, Shanghai, 201804, China
,
Jean-Hugues Chauchat Université de Lyon, Laboratoire ERIC-Lyon2, avenue Pierre Mendès-France, 5, Bron Cedex, 69676, France E-mail: jean-hugues.chauchat@univ-lyon2.fr
,
Rui Zhao Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China E-mail: zhihua.wei@hotmail.com, miaoduoqian@163.com, zhaorui1@126.com
&
Wen Li Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China E-mail: zhihua.wei@hotmail.com, miaoduoqian@163.com, zhaorui1@126.com; Key laboratory “Embedded System and Service Computing” Ministry of Education, Tongji University, Cao'an Road,4800, Shanghai, 201804, China
Pages 365-374 |Received 30 Dec 2008,Accepted 28 May 2009,Published online: 12 Mar 2012
 
Sample our Engineering & Technology journals, sign in here to start your access, latest two full volumes FREE to you for 14 days

Abstract

In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gram frequency umeric value (TF) and Tf*idf value.

Reprints and Corporate Permissions

Please note: Selecting permissionsdoes not provide access to the full text of the article, please see our help pageHow do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissionsdoes not provide access to the full text of the article, please see our help pageHow do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit thisPermissions form. For more information, please visit ourPermissions help page.

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

To cite this article:

Reference style:

Citation copied to clipboard
Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.
Choose format:
Choose options: