- Xin Chen ORCID:orcid.org/0000-0001-5931-50721,3,
- Yun Xue1,2,
- Hongya Zhao3,
- Xin Lu1,
- Xiaohui Hu1 &
- …
- Zhihao Ma1
1237Accesses
25Citations
Abstract
Feature extraction is one of the key steps for text sentiment analysis (SA), and the corresponding algorithms have important effect on the results. In the paper, a novel methodology is proposed to extract the feature for SA of product reviews. First, based on the diversified expression forms of product reviews, the generalized TF–IDF feature vectors are obtained by introducing the semantic similarity of synonyms. Then, in view of the different lengths of product reviews, the local patterns of the feature vectors are identified with OPSM biclustering algorithm. Finally, we improve PrefixSpan algorithm to detect the frequent and pseudo-consecutive phrases with high discriminative ability (namely FPCD phrases), which contain word-order information. Furthermore, some important factors, such as the separation and discriminative ability of words, are also employed to improve the discriminative ability of sentiment polarity. Based on the previous steps, the text feature vectors are extracted. A series of the experiment and comparison results indicate that the performance for SA on product review is greatly improved.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP), pp 79–86
Tan S, Zhang J (2008) An empirical study of sentiment analysis for Chinese documents. Expert Syst Appl 34(4):2622–2629
Zhang HJ, Ji Y, Li J, Ye Y (2016) A triple wing harmonium model for movie recommendation. IEEE Trans Ind Inf 12(1):231–239
Zhang Y (2015) Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM 2015, pp 435–440
Yaakub MR, Li Y, Zhang J (2013) Integration of sentiment analysis into customer relational model: the importance of feature ontology and synonym. Procedia Technol 11:495–501
Wang W, Tan G, Wang H (2016) Cross-domain comparison of algorithm performance in extracting aspect-based opinions from Chinese online reviews. Int J Mach Learn Cybern 8(3):1–18
Basu T, Murthy C (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892
Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10(3–4):373–384
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the Prefixspan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. Meet Assoc Comput Linguist Hum Lang Technol 2011:142–150
Salton G, Yu CT (1974) On the construction of effective vocabularies for information retrieval. ACM SIGIR Forum 9(3):48–60
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp 246–252
Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. International Conference on Neural Information Processing Systems, pp 1081–1088
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Proceedings of Workshop at International Conference on Learning Representations, pp 1–12
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. Conf Empir Methods Nat Lang Proc 2014:1532–1543
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Comput Sci 5(1):36
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprintarXiv:160704606
Wang Y, Liu Z, Sun M (2015) Incorporating linguistic knowledge for learning distributed word representations. PLoS ONE 10(4):e0118437
Matsumoto S, Takamura H, Okumura M (2005) Sentiment classification using word sub-sequences and dependency sub-trees. In: Pacific-Asia conference on knowledge discovery and data mining, 2005. Springer, pp 301–311
Dong Z, Dong Q (2003) HowNet—a hybrid language and knowledge resource. Int Conf Nat Lang Process Knowl Eng Proc 2003:820–824
Yuan B, Liu Y, Li H (2013) Sentiment classification in Chinese microblogs: lexicon-based and learning-based approaches. Int Proc Econ Dev Res 68:1
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Esuli A, Sebastiani F (2006) Sentiwordnet: a publicly available lexical resource for opinion mining. Proceedings of the 5th Conference on Language Resources and Evaluation, pp 417–422
Xu R, Chen T, Xia Y, Lu Q, Liu B, Wang X (2015) Word embedding composition for data imbalances in sentiment and emotion classification. Cogn Comput 7(2):226–240
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, Hoboken
Törönen P, Kolehmainen M, Wong G, Castren E (1999) Analysis of gene expression data using self-organizing maps. FEBS Lett 451(2):142–146
Xu JH, Liu H (2010) Web user clustering analysis based on Kmeans algorithm. In: 2010 international conference on information, networking and automation, 2010, pp V2-6–V2-9
Xue Y, Liu ZW, Luo J, Ma ZH, Zhang MZ, Hu XH, Kuang QH (2015) Stock market trading rules discovery based on biclustering method. Math Probl Eng 2015:1–13
Cheng Y, Church GM (2000) Biclustering of expression data. Int Conf Intell Syst Mol Biol 2000:93
Yang J, Wang W, Wang H (2002)/spl delta/-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th international conference on data engineering 2002, pp 517–528
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45
Liu ZW, Xue Y, Li MH, Ma B, Zhang MZ, Chen X, Hu XH (2017) Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining. Int J Data Min Bioinform 17(3):217–237
Wang H (2007) All common subsequences. In: Proceedings of the international joint conference on artificial intelligence, 2007, pp 635–640
Han JW, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) Freespan: frequent pattern-projected sequential pattern mining. Paper presented at the proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000, pp 355–359
Peterson EA, Tang P (2008) Mining frequent sequential patterns with first-occurrence forests. In: Proceedings of the 46th annual southeast regional conference on XX. ACM, 2008, pp 34–39
Zhang HP, Yu HK, Xiong DY, Liu Q (2003) HHMM-based Chinese lexical analyzer ICTCLAS. Sighan Workshop on Chinese Language Processing, pp 758–759
Wang C, Zhang M, Ma S, Ru L (2008) Automatic online news issue construction in web environment. Int Conf World Wide Web 2008:457–466
Hashimoto TB, Alvarezmelis D, Jaakkola TS (2015) Word, graph and manifold embedding from Markov processes. New Media & Society, pp 1–6
Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, Mcclosky D (2014) The Stanford Corenlp Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 55–60
Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl Based Syst 67(3):105–116
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830
Goodfellow I, Courville A, Bengio Y (2012) Large-scale feature learning with spike-and-slab sparse coding. Proceedings of the 29th International Conference on Machine Learning, pp 1439–1446
Zhang HJ, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537
Zhang HJ, Li J, Ji Y, Yue H (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inform 13(2):616–624
Zhang HJ, Cao X, Ho JKL, Chow TWS (2016) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531
Oyedotun OK, Khashman A (2016) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 2016:1–11
Acknowledgements
The authors thank gratefully for the colleagues participated in this work and provided technical supports. This work is supported by Guangdong Provincial Engineering Technology Research Center for Data Science (Nos. 2016KF09, 2016KF10), and the National Statistical Science Research Project of China (Nos. 2015LY81, 2016LY98). This work was also supported by the Science and Technology Department of Guangdong Province in China (Grant Nos. 2016A010101020, 2016A010101021, 2016A010101022), the grant from Guangdong Province Science and Technology Planning Project (No. 2013B040404009), Foundation of Guangdong Polytechnic of Science and Technology (No. XJSC2016206), Natural Science Funds of Shenzhen Science and Technology Innovation Commission (No. JCYJ20160527172144272) and the Innovation Project of Graduate School of South China Normal University (No. 2015lkxm37).
Author information
Authors and Affiliations
School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, 510006, China
Xin Chen, Yun Xue, Xin Lu, Xiaohui Hu & Zhihao Ma
Guangdong Provincial Key Laboratory of Quantum Engineering and Quantum Materials, Guangdong Provincial Engineering Technology Research Center for Data Science, Guangzhou, 510006, China
Yun Xue
Shenzhen PolyTechnic, Shenzhen, 518055, China
Xin Chen & Hongya Zhao
- Xin Chen
You can also search for this author inPubMed Google Scholar
- Yun Xue
You can also search for this author inPubMed Google Scholar
- Hongya Zhao
You can also search for this author inPubMed Google Scholar
- Xin Lu
You can also search for this author inPubMed Google Scholar
- Xiaohui Hu
You can also search for this author inPubMed Google Scholar
- Zhihao Ma
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toYun Xue.
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Chen, X., Xue, Y., Zhao, H.et al. A novel feature extraction methodology for sentiment analysis of product reviews.Neural Comput & Applic31, 6625–6642 (2019). https://doi.org/10.1007/s00521-018-3477-2
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative