Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 12682))
Included in the following conference series:
3040Accesses
Abstract
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semi-structured data classification plays an important role in many data analysis applications. In addition to content information, semi-structured data also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.
This work was supported in part by the National Natural Science Foundation of China under Grant 61972317, Grant 61672432 and Grant 61732014, and in part by the Fundamental Research Funds for the Central Universities of China under Grant 3102015JSJ0004.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 11439
- Price includes VAT (Japan)
- Softcover Book
- JPY 14299
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Costa, G., Ortale, R.: XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-Grams. Int. J. Artif. Intell. Tools26(1), 1–24 (2017)
Costa, G., Ortale, R.: Fully-automatic XML clustering by structure-constrained phrases. In: Proceedings IEEE 27th International Conference on Tools with Artificial Intelligence, Vietri sul Mare, Italy, pp. 146–153 (2015)
Tekli, J.: An overview on XML semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans. Knowl. Data Eng.28(6), 1383–1407 (2016)
Piernik, M., Brzezinski, D., Morzy, T.: Clustering XML documents by patterns. Knowl. Inf. Syst.46(1), 185–212 (2015).https://doi.org/10.1007/s10115-015-0820-0
Zhao, X., Bi, X., Wang, G., et al.: Uncertain XML documents classification using extreme learning machine. Neurocomputing174, 375–382 (2016)
Costa, G., Ortale, R.: Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Proceedings 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macau, China, pp. 237–248 (2019)
Tran, T., Nayak, R., Bruza, P.D.: Combining structure and content similarities for XML document clustering. In: Proceeedings the 7th Australasian Data Mining Conference (AusDM 2008), pp. 219–226 (2008)
Ghosh, S., Mitra, P.: Combining content and structure similarity for XML document classification using composite SVM Kernels. In: Proceedings 19th International Conference on Pattern Recognition (ICPR 2008), pp. 1–4 (2008)
Zhang, L., Li, Z., Chen, Q., Li, N.: Structure and content similarity for clustering XML documents. In: Shen, H.T., et al. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010).https://doi.org/10.1007/978-3-642-16720-1_12
Yuan, J., Xu, D., Bao, H.: An efficient XML documents classification method based on structure and keywords frequency. J. Comput. Res. Dev.43(8), 1361–1367 (2006)
Costa, G., Ortale, R., Ritacco, E.: Effective XML classification using content and structural information via rule learning. In: Proceedings the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011), pp. 102–109 (2011)
Yang, J., Zhang, F.: XML document classification using extended VSM. In: Proceedings 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 234–244 (2008)
Yang, J., Wang, S.: Extended VSM for XML document classification using frequent subtrees. In: Proceedings 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 441–448 (2009)
Zhao, X., Bi, X., Qiao, B.: Probability based voting extreme learning machine for multiclass XML documents classification. World Wide Web17(5), 1217–1231 (2013).https://doi.org/10.1007/s11280-013-0230-8
Costa, G., Ortale, R.: Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf. Retrieval J.21(1), 24–55 (2017).https://doi.org/10.1007/s10791-017-9314-x
Mladenic, D., Globelnik, M.: Word sequences as features in text learning. the 17th Electrotechnical and Computer Science Conference (ERK 1998), Slovenia, pp. 145–148 (1998)
Furnkranz, J.: A Study Using n-gram features for text categorization. Austrian Res. Instit. Artif. Intell.3, 1–10 (1998)
Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of Naive Bayes text classifier. In: Proceedings the 5th International Conference on Computational Intelligence and Multimedia Applications, pp. 336–441 (2003)
Meretakis, D., Wuthrich, B.: Extending Naive Bayes classifiers using long itemsets. In: Proceedings the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 1999), pp. 165–174 (1999)
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings the ACM Symposium on Document Engineering, pp. 138–146 (2006)
Zhang, L., Li, Z., Chen, Q., Li, X., Li, N., Lou, Y.: Mining frequent association tag sequences for clustering XML documents. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 85–96. Springer, Heidelberg (2012).https://doi.org/10.1007/978-3-642-29253-8_8
Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Technical report IEI-B4-07-2000. Istituto di Elaborazione dell’Informazione, Pisa, Italy (2000)
Mitra, M., Buckley, C., Singhal, A., Cardie, C: An analysis of statistical and syntactic phrases. In: The 5th International Conference on Recherche d’Information Assistee par Ordinateur (RIAO 1997), Montreal, CA, pp. 200–214 (1997)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: The 7th ACM International Conference on Information and Knowledge Management (CIKM 1998), New York, US, pp. 148–155. ACM Press (1998)
Tesar, R., Fiala, D., Rousselot, F., Jezek, K.: A comparison of two algorithms for discovering repeated word sequences. WIT transaction on information and communication technologies35, 121–131 (2005)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: The 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Rezk, N.G., Sarhan, A., Algergawy, A.: Clustering of XML documents based on structure and aggregated content. In: Proceedings 11th International Conference on Computer Engineering and Systems, Cairo, Egypt, pp. 93–102 (2016)
Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007 categorization and clustering of XML documents. SIGIR forum42, 22–28 (2008)
Kurt, A., Tozal, E.: Classification of XSLT-generated web documents with support vector machines. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 33–42. Springer, Heidelberg (2006).https://doi.org/10.1007/11730262_6
Wu, J., Tang, J.: A bottom-up approach for XML documents classification. In: The 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 131–137. ACM (2008)
Zhang, L., Li, Z., Chen, Q., et al.: Classifying XML documents based on term semantics. Jilin Daxue Xuebao/J. Jilin Univ. (Eng. Technol. Edn.)42(6), 1510–1514 (2012)
Author information
Authors and Affiliations
School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
Lijun Zhang, Ning Li, Wei Pan & Zhanhuai Li
Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
Lijun Zhang, Ning Li, Wei Pan & Zhanhuai Li
- Lijun Zhang
You can also search for this author inPubMed Google Scholar
- Ning Li
You can also search for this author inPubMed Google Scholar
- Wei Pan
You can also search for this author inPubMed Google Scholar
- Zhanhuai Li
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toLijun Zhang.
Editor information
Editors and Affiliations
Aalborg University, Aalborg, Denmark
Christian S. Jensen
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Academia Sinica, Taipei, Taiwan
De-Nian Yang
The Pennsylvania State University, University Park, PA, USA
Wang-Chien Lee
National Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Athens University of Economics and Business, Athens, Greece
Vana Kalogeraki
National Cheng Kung University, Tainan City, Taiwan
Jen-Wei Huang
National Tsing Hua University, Hsinchu, Taiwan
Chih-Ya Shen
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, L., Li, N., Pan, W., Li, Z. (2021). A Semi-structured Data Classification Model with Integrating Tag Sequence and Ngram. In: Jensen, C.S.,et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_14
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-030-73196-0
Online ISBN:978-3-030-73197-7
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative