Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Improving Language Identification of Web Page Using Optimum Profile

  • Conference paper

Part of the book series:Communications in Computer and Information Science ((CCIS,volume 180))

Abstract

Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is usingN-grams frequency andN-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, pp. 161–175 (1994)

    Google Scholar 

  2. Xafopoulos, A., Kotropoulos, C., Almpanidis, G., Pitas, I.: Language identification in web documents using discrete hmms. Pattern Recognition 37(3), 583–594 (2004)

    Article  Google Scholar 

  3. Selamat, A., Ng, C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recognition 44(1), 133–144 (2011)

    Article MATH  Google Scholar 

  4. Muthusamy, Y., Spitz, A.: Automatic language identification. In: Cole, R., Mariani, J., Uszkoreit, H., Varile, G., Zaenen, A., Zampolli, A. (eds.) Survey of the State of the Art in Human Language Technology, pp. 255–258. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  5. Constable, P., Simons, G.: Language identification and it: Addressing problems of linguistic diversity on a global scale. In: Proceedings of the 17th International Unicode Conference, SIL Electronic Working Papers, San José, California, pp. 1–22 (2000)

    Google Scholar 

  6. Abd Rozan, M.Z., Mikami, Y., Abu Bakar, A.Z., Vikas, O.: Multilingual ict education: Language observatory as a monitoring instrument. In: Proceedings of the South East Asia Regional Computer Confederation 2005: ICT Building Bridges Conference, Sydney, Australia, vol. 46 (2005)

    Google Scholar 

  7. McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Information Retrieval 7(1), 73–97 (2004)

    Article  Google Scholar 

  8. Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768 (2005)

    Google Scholar 

  9. Simons, G.F.: Language identification in metadata descriptions of language archive holdings. In: Workshop on Web-Based Language Documentation and Description, Philadelphia, USA (2000)

    Google Scholar 

  10. Hakkinen, J., Tian, J.: N-gram and decision tree based language identification for written words. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 335–338 (2001)

    Google Scholar 

  11. Takcı, H., Soğukpınar, İ.: Letter Based Text Scoring Method for Language Identification. In: Yakhno, T. (ed.) ADVIS 2004. LNCS, vol. 3261, pp. 283–290. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Biemann, C., Teresniak, S.: Disentangling from babylonian confusion – unsupervised language identification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 773–784. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  13. Hammarstrom, H.: A fine-grained model for language identification. In: Workshop of Improving Non English Web Searching, Amsterdam, The Netherlands, pp. 14–20 (2007)

    Google Scholar 

  14. da Silva, J.F., Lopes, G.P.: Identification of document language is not yet a completely solved problem. In: Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce, pp. 212–219. IEEE Computer Society, Washington, DC, USA (2006)

    Google Scholar 

  15. Ng, C., Selamat, A.: Improve feature selection method of web page language identification using fuzzy artmap. International Journal of Intelligent Information and Database Systems 4(6), 629–642 (2010)

    Article  Google Scholar 

  16. Selamat, A., Subroto, I., Ng, C.: Arabic script web page language identification using hybrid-knn method. International Journal of Computational Intelligence and Applications 8(3), 315–343 (2009)

    Article MATH  Google Scholar 

  17. Selamat, A., Ng, C.: Arabic script language identification using letter frequency neural networks. International Journal of Web Information Systems 4(4), 484–500 (2008)

    Article  Google Scholar 

  18. Choong, C.Y., Mikami, Y., Marasinghe, C.A., Nandasara, S.T.: Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. International Journal on Advances in ICT for Emerging Regions 2(2), 21–28 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Faculty of Computer Systems & Software Engineering, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300, Gambang, Kuantan Pahang, Malaysia

    Choon-Ching Ng

  2. Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, 81310, UTM Skudai, Johor, Malaysia

    Ali Selamat

Authors
  1. Choon-Ching Ng

    You can also search for this author inPubMed Google Scholar

  2. Ali Selamat

    You can also search for this author inPubMed Google Scholar

Editor information

Editors and Affiliations

  1. Faculty of Computer Systems and Software Engineering, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300 Gambang, Kuantan, Pahang, Malaysia

    Jasni Mohamad Zain

  2. Faculty of Computer Systems and Software Engineering, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300, Gambang, Kuantan, Pahang, Malaysia

    Wan Maseri bt Wan Mohd

  3. Information Systems Department, King Saud University, 11543, Riyadh, Saudi Arabia

    Eyas El-Qawasmeh

Rights and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ng, CC., Selamat, A. (2011). Improving Language Identification of Web Page Using Optimum Profile. In: Zain, J.M., Wan Mohd, W.M.b., El-Qawasmeh, E. (eds) Software Engineering and Computer Systems. ICSECS 2011. Communications in Computer and Information Science, vol 180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22191-0_14

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp