Part of the book series:Lecture Notes in Computer Science ((TCCI,volume 6910))
645Accesses
Abstract
Language identification has been widely used for machine translations and information retrieval. In this paper, an improved N-grams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on N-grams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall andF1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 5719
- Price includes VAT (Japan)
- Softcover Book
- JPY 7149
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gordon, R.G.: Ethnologue: Languages of the world. In: SIL International Dallas, TX (2005)
Abd Rozan, M.Z., Mikami, Y., Abu Bakar, A.Z., Vikas, O.: Multilingual ict education: Language observatory as a monitoring instrument. In: Proceedings of the South East Asia Regional Computer Confederation 2005: ICT Building Bridges Conference, Sydney, Australia, vol. 46, pp. 53–61 (2005)
Maclean, D.: Beyond english: Transnational corporations and the strategic management of language in a complex multilingual business environment. Management Decision 44(10), 1377–1390 (2006)
Redondo-Bellon, I.: The effects of bilingualism on the consumer: The case of spain. European Journal of Marketing 33(11/12), 1136–1160 (1999)
Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, Elsevier Science (2010), doi:10.1016/j.patcog.2010.07.009
Chowdhury, G.G.: Natural language processing. Annual Review of Information Science and Technology 37(1), 51–89 (2003)
Lewandowski, D.: Problems with the use of web search engines to find results in foreign languages. Online Information Review 32(5), 668–672 (2008)
Jin, H., Wong, K.F.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1(4), 281–296 (2002)
Botha, G., Zimu, V., Barnard, E.: Text-based language identification for the south african languages. In: Proceedings of the 17th Annual Symposium of the Pattern Recognition Association of South Africa 2006, Parys, South Africa, pp. 7–13 (2006)
Ng, C.-C., Selamat, A.: Improve feature selection method of web page language identification using fuzzy artmap. International Journal of Intelligent Information and Database Systems 4(6), 629–642 (2010)
Barroso, N., de Ipiña, K.L., Ezeiza, A., Barroso, O., Susperregi, U.: Hybrid approach for language identification oriented to multilingual speech recognition in the basque context. In: Graña Romay, M., Corchado, E., Garcia Sebastian, M.T. (eds.) HAIS 2010. LNCS (LNAI), vol. 6076, pp. 196–204. Springer, Heidelberg (2010)
Wang, H., Xiao, X., Zhang, X., Zhang, J., Yan, Y.: A hierarchical system design for language identification. In: 2nd International Symposium on Information Science and Engineering, ISISE 2009, pp. 443–446 (2010)
Amine, A.B., Elberrichi, Z., Simonet, M.: Automatic language identification: an alternative unsupervised approach using a new hybrid algorithm. International Journal of Computer Science and Applications 7(1), 94–107 (2010)
Xiao, H., Yu, L., Chen, K.: An efficient method of language identification using lvq network. In: International Conference on Signal Processing Proceedings, ICSP, pp. 1690–1694 (2008)
Řehůřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS (LNAI), vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
You, J.-L., Chen, Y.-N., Chu, M., Soong, F.K., Wang, J.-L.: Identifying language origin of named entity with multiple information sources. IEEE Transactions on Audio, Speech and Language Processing 16(6), 1077–1086 (2008)
Ng, R., Lee, T.: Entropy-based analysis of the prosodic features of chinese dialects. In: Proceedings - 2008 6th International Symposium on Chinese Spoken Language Processing, ISCSLP 2008, pp. 65–68 (2008)
Deng, Y., Liu, J.: Automatic language identification using support vector machines and phonetic n-gram. In: ICALIP 2008, Proceedings of 2008 International Conference on Audio, Language and Image Processing, pp. 71–74 (2008)
Botha, G., Zimu, V., Barnard, E.: Text-based language identification for south african languages. Transactions of the South African Institute of Electrical Engineers 98(4), 141–148 (2007)
Cordoba, R., D’Haro, L., Fernandez-Martinez, F., Macias-Guarasa, J., Ferreiros, J.: Language identification based on n-gram frequency ranking. In: 8th Annual Conference of the International Speech Communication Association, Interspeech 2007., vol. 3, pp. 1921–1924 (2007)
Thomas, S., Verma, A.: Language identification of person names using cf-iof based weighing function. In: 8th Annual Conferenceof the International Speech Communication Association, Interspeech 2007, vol. 1, pp. 361–364 (2007)
Suo, H., Li, M., Liu, T., Lu, P., Yan, Y.: The design of backend classifiers in pprlm system for language identification. In: Proceedings of Third International Conference on Natural Computation, ICNC 2007, vol. 1, pp. 678–682 (2007)
Moscola, J., Cho, Y., Lockwood, J.: Hardware-accelerated parser for extraction of metadata in semantic network content. In: IEEE Aerospace Conference Proceedings (2007)
Yang, X., Siu, M.: N-best tokenization in a gmm-svm language identification system. In: ICASSP, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV1005–IV1008 (2007)
Rouas, J.L.: Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech and Language Processing 15(6), 1904–1911 (2007)
Hanif, F., Latif, F., Sikandar Hayat Khiyal, M.: Unicode aided language identification across multiple scripts and heterogeneous data. Information Technology Journal 6(4), 534–540 (2007)
Li, H., Ma, B., Lee, C.H.: A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15(1), 271–284 (2007)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval 1994, Las Vegas, Nevada, USA, pp. 161–175 (1994)
Choong, C., Mikami, Y., Marasinghe, C., Nandasara, S.: Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. International Journal on Advances in ICT for Emerging Regions 2(2), 21–28 (2009)
Author information
Authors and Affiliations
Software Engineering Research Group, Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, UTM Johor Baharu Campus, 81310, Johor, Malaysia
Ali Selamat
- Ali Selamat
You can also search for this author inPubMed Google Scholar
Editor information
Editors and Affiliations
Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370, Wroclaw, Poland
Ngoc Thanh Nguyen
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Selamat, A. (2011). Improved N-grams Approach for Web Page Language Identification. In: Nguyen, N.T. (eds) Transactions on Computational Collective Intelligence V. Lecture Notes in Computer Science, vol 6910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24016-4_1
Download citation
Publisher Name:Springer, Berlin, Heidelberg
Print ISBN:978-3-642-24015-7
Online ISBN:978-3-642-24016-4
eBook Packages:Computer ScienceComputer Science (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative