Movatterモバイル変換

1202Accesses
17Citations
Explore all metrics

Abstract

Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Ababneh M, Al-Shalabi R, Kanaan G, Al-Nobani A (2012) Building an effective rule-based light stemmer for Arabic language to improve search effectiveness. Int Arab J Inf Technol (IAJIT) 9(4):368–372
Google Scholar
Abbas Q (2012) Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank. In: International conference on intelligent text processing and computational linguistics. Berlin, pp 66–79
Abu-Errub A, Odeh A, Shambour Q, Hassan OA-H (2014) Arabic roots extraction using morphological analysis. Int J Comput Sci 11:2
Akram QA, Naseer A, Hussain S (2009) Assas-Band, an affix-exception-list based Urdu stemmer. In: Proceedings of the 7th workshop on Asian language resources. Association for Computational Linguistics, pp 40–46
Aljlayl M, Frieder O (2002) On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM, pp 340–347
Al-Kabi M, Al-Mustafa R (2006) Arabic root based stemmer. In: Proceedings of the international Arab conference on information technology
Al-Kabi M, Al-Shawakfa E, Alsmadi I (2013) The effect of stemming on Arabic text classification: an empirical study. Inf Retr Methods Multidiscip Appl 207–225
Al-Kabi MN, Kazakzeh SA, Ata BMA, Al-Rababah SA, Alsmadi IM (2015) A novel root based Arabic stemmer. J King Saud Univ Comput Inf Sci 27(2):94–103
Google Scholar
Al-Omari A, Abuata B (2014) Arabic light stemmer (ARS). J Eng Sci Technol 9(6):702–717
Google Scholar
Al-Shammari ET (2013) Lemmatizing, stemming, and query expansion method and system. U.S. Patent No. 8,473,279. 25 Jun 2013
Al-Shammari ET, Lin J (2008) Towards an error-free Arabic stemming. In: Proceedings of the 2nd ACM workshop on improving non English web searching. ACM, pp 9–16
Balakrishnan V, Lloyd-Yemoh E (2014) Stemming and lemmatization: a comparison of retrieval performances. Lect Notes Softw Eng 2(3):262–267
Article Google Scholar
Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48–57
Article Google Scholar
Carpineto C, Romano G (2012) A survey of automatic query expansion in information retrieval. ACM Comput Surv (CSUR) 44(1):1
Article MATH Google Scholar
Chen A, Gey FC (2002) Building an Arabic stemmer for information retrieval. In: TREC, pp 631–639
Chris DP (1990) Another stemmer. ACM. SIGIR Forum 24(3):56–61
Article Google Scholar
Dahab MY, Al-Mutawa R (2015) A comparative study on Arabic stemmers. Change 125(8):
Dianati MH, Hadi SM, Rasekh AH, Fakhrahmad SM, Taghi-Zadeh H (2014) Words stemming based on structural and semantic similarity. Comput Eng Appl J 3(2):89–99
Google Scholar
Ebrahim S, Hegazy D, Mostafa MG, El-Beltagy SR (2015) English–Arabic statistical machine translation: state of the art. In: International conference on intelligent text processing and computational linguistics. Springer International Publishing, pp 520–533
El-Beltagy Samhaa R, Rafea Ahmed (2011) An accuracy-enhanced light stemmer for arabic text. ACM Trans Speech Lang Process (TSLP) 7(2):2
El-Defrawy M, El-Sonbaty Y, Belal NA (2015) Cbas: context based arabic stemmer. Int J Nat Lang Comput (IJNLC) 4(3):1–12
Article Google Scholar
El Kholy A et al (2013) Selective combination of pivot and direct statistical machine translation models. In: Proceedings of the 6th international joint conference on natural language processing
Estahbanati A, Javidan R, Dezfooli MA (2011) Implementation of a new method for stemming in Persian language. In: Proceedings of the international conference on web intelligence, mining and semantics. ACM, p 63
Frakes WB (1992) Information retrieval: data structures and algorithms, Chapter 8.http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap08.htm. Retrieved 1 Oct 2015
Ghwanmeh S, Kanaan G, Al-Shalabi R, Rabab’ah S (2009) Enhanced algorithm for extracting the root of Arabic words. In: Sixth international conference on computer graphics, imaging and visualization, 2009. CGIV’09. IEEE, pp 388–391
Goweder A, Alhami H, Rashed T, Al-Musrati A (2008) A hybrid method for stemming Arabic text. J Comput Sci.http://eref.uqu.edu.sa/files/eref2/folder6/f181.pdf
Gupta V, Joshi N, Mathur I (2013) Rule based stemmer in Urdu. In: 2013 4th international conference on computer and communication technology (ICCCT). IEEE, pp 129–132
Gupta V, Joshi N, Mathur I (2015) Design and development of rule based inflectional and derivational Urdu stemmer ‘Usal’. In: 2015 international conference on futuristic trends on computational analysis and knowledge management (ABLAZE). IEEE, pp 7–12
Habash N (2007) Arabic morphological representations for machine translation. Arabic computational morphology. Springer, Netherlands, pp 263–285
Chapter Google Scholar
Hadni M, Lachkar A, Alaoui OS (2012) A new and efficient stemming technique for Arabic Text Categorization. In: 2012 international conference on multimedia computing and systems (ICMCS). IEEE
Hadni M, Ouatik SA, Lachkar A (2013) Effective Arabic stemmer based hybrid approach for Arabic text categorization. Int J Data Min Knowl Manag Process (IJDKP) 3(4):1–14
Article Google Scholar
Husain MS, Ahamad F, Khalid S (2013) A language independent approach to develop Urdu stemmer. Advances in computing and information technology. Springer, Berlin, pp 45–53
Chapter Google Scholar
Hussain S (2008) Resources for Urdu language processing. In: IJCNLP, pp 99–100
Hussain S, Afzal M, (2001) Urdu computing standards: Urdu zabta takhti (uzt) 1.01. In: Multi topic conference, (2001) IEEE INMIC 2001, Technology for the 21st century. Proceedings, IEEE International, IEEE
Khan S, Anwar W, Bajwa U, Wang X (2015) Template based affix stemmer for a morphologically rich language. Int Arab J Inf Technol 12(2):146–154
Khan SA, Anwar W, Ijaz BU, Wang X (2012) A light weight stemmer for Urdu language: a scarce resourced language. In: 24th international conference on computational linguistics, p 69
Khansir AA, Mozafari N (2014) The impact of Persian language on Indian languages. Theory Pract Lang Stud 4(11):2360–2365
Google Scholar
Khoja S, Garside R (1999) Stemming Arabic text 1999.http://zeus.cs.pacificu.edu/shereen/research.htm#stemming. Accessed 27 Dec 2015
Korenius T et al (2004) Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM
Lakshmi RV, Kumar SBR (2014) Literature review: stemming algorithms for Indian and Non-Indian languages. Int J Adv Res Comput Sci Technol 2(3):349–352
Google Scholar
Larkey LS, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–282
Lehal RKVGGS (2012) Rule based Urdu stemmer. In: 24th international conference on computational linguistics, p 267
Lovins JB (1968) Development of a stemming algorithm. Electronic Systems Laboratory, MIT Information Processing Group, Cambridge
Google Scholar
Madnani N, Tetreault J, Chodorow M (2012) Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics
Mahmoodi M, Varnamkhasti MM (2014) Design a Persian automated plagiarism detector (AMZPPD). arXiv preprintarXiv:1403.1618
Majumder P, Mandar M, Swapan KP, Kole G, Mitra P, Datta K (2007) YASS: yet another suffix stripper. ACM Trans Inf Syst (TOIS) 25(4):18
Article Google Scholar
Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management. ACM, pp 131–138
Moghadam FM, Keyvanpour M (2015) Comparative study of various Persian stemmers in the field of information retrieval. J Inf Process Syst 11(3):450–464
Mokhtaripour A, Jahanpour S (2006) Introduction to a new Farsi stemmer. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 826–827
Mubashir Ali SK, Saleemi MH (2014) A novel stemming approach for Urdu language. J Appl Environ Biol Sci 4(7S)436–443. ISSN: 2090–4274.www.textroad.com
Nwesri AFA, Tahaghoghi SMM, Scholer F (2005) Stemming Arabic conjunctions and prepositions. International symposium on string processing and information retrieval. Springer, Berlin, pp 206–217
Chapter Google Scholar
Paice CD (1994) An evaluation method for stemming algorithms. Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 42–50
Google Scholar
Piotrowski M (2012) Natural language processing for historical texts. Synth Lect Hum Lang Technol 5(2):1–157
Article Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Google Scholar
Rahimi A (2015) A new hybrid stemming algorithm for Persian. arXiv preprintarXiv:1507.03077
Rahimtoroghi E, Faili H, Shakery A (2010) A structural rule-based stemmer for Persian. In: 2010 5th international symposium on telecommunications (IST). IEEE, pp 574–578
Rashidi A, Lighvan MZ (2014) HPS: a hierarchical Persian stemming method. arXiv preprintarXiv:1403.2837
Sarabi Z, Hamidreza M, Mojgan F (2013) Parsi Pardaz: Persian Language Processing Toolkit. In: 2013 3rd international conference on computer and knowledge engineering (ICCKE). IEEE
Saraee M, Bagheri A (2013) Feature selection methods in Persian sentiment analysis. International conference on application of natural language to information systems. Springer, Berlin, pp 303–308
Google Scholar
Seo Y-W, Ankolekar A, Sycara K (2004) Feature selection for extracting semantically rich words. No. CMU-RI-TR-04–18. Robotics Inst., Carnegie-Mellon Univ., Pittsburgh
Sharifloo AA, Shamsfard M (2008) A bottom up approach to Persian stemming. In: IJCNLP, pp 583–588
Sirsat SR, Chavan V, Mahalle HS (2013) Strength and accuracy analysis of affix removal stemming algorithms. Int J Comput Sci Inf Technol 4(2):265–269
Google Scholar
Taghi-Zadeh H, Hadi SM, Diyanati MH, Rasekh AH (2015) A new hybrid stemming method for Persian language. Digital Scholarship in the Humanities: fqv053
Taghva K, Beckley R, Sadeh M (2005a) A stemming algorithm for the farsi language. In: Null. IEEE, pp 158–162
Taghva K, Elkhoury R, Coombs J (2005b) Arabic stemming without a root dictionary. In: Innull. IEEE, pp 152–157
Tahir N (2014) Impact of Arabic language on Urdu language. VFAST Trans Islam Res 5(1):1–13
MathSciNet Google Scholar
Tashakori M, Meybodi M, Oroumchian F (2002) Bon: the Persian stemmer. EurAsia-ICT 2002: information and communication technology. Springer, Berlin, pp 487–494
Chapter Google Scholar
Zughoul M, Abu-Alshaar A (2005) English/Arabic/English machine translation: a historical perspective. Transl J 50(3):1022–1041
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Institute of Southern Punjab, Multan, Pakistan
Abdul Jabbar
Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan
Sajid Iqbal
Al-Khwarzmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Muhammad Usman Ghani Khan
Department of Computer Science, Bahauddin Zakariya University (Sahiwal Sub-campus), Multan, Pakistan
Shafiq Hussain

Authors

Abdul Jabbar
View author publications
You can also search for this author inPubMed Google Scholar
Sajid Iqbal
View author publications
You can also search for this author inPubMed Google Scholar
Muhammad Usman Ghani Khan
View author publications
You can also search for this author inPubMed Google Scholar
Shafiq Hussain
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toSajid Iqbal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jabbar, A., Iqbal, S., Khan, M.U.G.et al. A survey on Urdu and Urdu like language stemmers and stemming techniques.Artif Intell Rev49, 339–373 (2018). https://doi.org/10.1007/s10462-016-9527-1

Download citation

Published:28 November 2016
Issue Date:March 2018
DOI:https://doi.org/10.1007/s10462-016-9527-1

Movatterモバイル変換

A survey on Urdu and Urdu like language stemmers and stemming techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language

Design and Development of Marathi Word Stemmer

Analyzing the Stemming Paradigm

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Access this article

Subscribe and save

Buy Now