849Accesses
Abstract
The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel “1984” by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.
Notes
EAGLES-based harmonized tagsets have been also used for various other language resources, such as those of the LE-PAROLE project, which produced a multilingual corpus and associated lexica for 14 European languages (Zampolli1997).
References
Alexin, Z., Gyimóthy, T., Hatvani, C., Tihanyi, L., Csirik, J., Bibok, K., et al. (2003). Manually annotated hungarian corpus. InProceedings of the tenth conference on European chapter of the association for computational linguistics (EACL’03) (pp. 53–56).
Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: Nova generacija slovenskega referenčnega korpusa (the FidaPLUS corpus: A new generation of the Slovene reference corpus).Jezik in slovstvo, 52(2), 95–110.
Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. InProceedings of the tenth conference on computational natural language learning (CoNLL-X) (pp. 149–164). Morristown, NJ, USA: ACL.
Chiarcos, C., & Erjavec, T. (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. InProceedings of the 5th linguistics annotation workshop (LAW-V), ACL.
Derzhanski, I. A., & Kotsyba, N. (2009). Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian. InProceedings of the Mondilex third open workshop: Metalanguage and encoding scheme design for digital lexicography (pp. 9–26). Bratislava, Slovakia: Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences.
Dimitrova, L., & Rashkov, P. (2009). A new version for Bulgarian MTE morphosyntactic specifications for some verbal forms. InProceedings of the Mondilex second open workshop: Organization and development of digital lexical eesources (pp. 30–37). Kyiv, Ukraine: Dovira Publishing House.
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H. J., Petkevič, V., & Tufiş, D. (1998). MULTEXT-East: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In Proceedings of the COLING-ACL’98 (pp. 315–319). Montréal, QC, Canada: ACL.
Džeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Žabokrtsky, Z., & Žele, A. (2006).Towards a Slovene dependency treebank. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06), Genoa.
EAGLES. (1996).Expert advisory group on language engineering standards.http://www.ilc.pi.cnr.it/EAGLES/home.html.
Erjavec, T. (2004). MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the fourth international conference on language resources and evaluation (LREC’06), Lisbon.
Erjavec, T. (2010) MULTEXT-East version 4: Multilingual morphosyntactic specifications, lexicons and Corpora. InProceedings of the seventh international conference on language resources and evaluation (LREC’06), Valetta.
Erjavec, T., & Džeroski, S. (2004). Machine learning of language structure: Lemmatising unknown Slovene words.Applied Artificial Intelligence,18(1), 17–41.
Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010).The JOS linguistically tagged corpus of Slovene. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valetta.
Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web.GLOT International,7(3), 97–100.
Feldman, A., & Hana, J. (2010).A resource-light approach to morpho–syntactic tagging. Language and computers: Studies in practical linguistics (Vol. 70). Amsterdam: Rodopi.
Garabík, R., & Gianitsová-Ološtiaková, L. (2005). Manual morphological annotation of the Slovak translation of Orwell’s novel 1984: Methods and findings. InProceedings of the Slovko conference “computer treatment of Slavic and East European languages”. Bratislava: Veda.
Garabík, R., Majchráková, D., & Dimitrova, L. (2009). Comparing Bulgarian and Slovak MULTEXT-East morphology tagset. InProceedings of the Mondilex second open workshop: Organization and development of digital lexical resources (pp. 38–46). Kyiv, Ukraine: Dovira Publishing House.
Hajič, J. (2000). Morphological tagging: Data versus dictionaries. InProceedings of the ANLP/NAACL 2000 (pp. 94–101). Seattle.
Hajič, J. (2002).Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Karolinum Charles University Press.
Horák, A., Gianitsová, L., Šimková, M., Šmotlák, M., & Garabík, R. (2004). Slovak national corpus. InProceedings of the text speech and dialogue conference (TSD’04), Brno.
Ide, N. (1998). Corpus encoding standard: SGML guidelines for encoding linguistic corpora. InProceedings of the first international conference on language resources and evaluation (LREC’98) (pp. 463–470). Granada.
Ide, N. (2000). Cross-lingual sense determination: Can it work?Computers and the Humanities, 34, 223–234.
Ide, N., & Véronis, J. (1994). Multext (multilingual tools and corpora). InProceedings of the 15th international conference on computational linguistics (CoLing’94) (pp. 90–96). Kyoto.
Ivanovska, A., Zdravkova, K., Džeroski, S., & Erjavec, T. (2005). Learning rules for morphological analysis and synthesis of Macedonian nouns. InProceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech.
Kopotev, M., & Mustajoki, A. (2003)Principy sozdanija Hel’sinkskogo annotirovannogo korpusa russkih tekstov (HANCO) v seti internet. Naučno-tehničeskaja informacija (Ser. 2, pp. 33–37) (in Russian).
Kotsyba, N., Radziszewski, A., & Derzhanski, I. (2009). Integrating the Polish language into the MULTEXT-East family. InProceedings of the Mondilex fifth open workshop:Research infrastructure for digital lexicography. Ljubljana, Slovenia: Jožef Stefan Institute.
Krek, S., Stabej, M., Gorjanc, V., Erjavec, T., Romih, M., & Holozan, P. (1998) FIDA:A corpus of the Slovene language.http://www.fida.net/.
Krstev, C., Vitas, D., & Erjavec, T. (2004). MULTEXT-East resources for Serbian. InProceedings B of the 7th international multiconference information society: Language technologies (pp. 108–114). Ljubljana: Jožef Stefan Institutue.
Martin, J., Mihalcea, R., & Pedersen, T. (2005). Word alignment for languages with scarce resources. InProceedings of the ACL workshop on building and using parallel texts (pp. 65–74). Ann Arbor.
Petrovski, A. (2004). Morphological processing of nouns in Macedonian language. In Proceedings of the 7th intex/nooj workshop, Tours.
Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation.Task Quarterly, 11, 151–167.
Prószéky, G. (1995). Humor: A morphological system for corpus analysis. InProceedings of the first European TELRI seminar: Language resources for language technology (pp. 149–158). Tihany, Hungary.
Prószéky, G., & Kis, B. (1999). A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. InProceedings of the 37th ACL, association for computational linguistics (pp. 261–268).
Przepiórkowski, A. (2006). The potential of the IPI PAN corpus.Poznań Studies in Contemporary Linguistics, 41, 31–48.
Przepiórkowski, A., & Woliński, M. (2003). A flexemic tagset for Polish. InProceedings of the EACL workshop on morphological processing of Slavic languages. ACL.
QasemiZadeh, B., & Rahimi, S. (2006) Persian in MULTEXT-East framework. In Proceedings of the 5th international conference on natural language processing (FinTAL’06) (pp. 541–551). Turku, Finland.
Rosen, A. (2010). Morphological tags in parallel corpora. In F. Čermák, A. Klégr, & P. Corness (Eds.),InterCorp:Exploring a Multilingual corpus. Praha: Nakladatelství Lidové noviny.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. InProceedings of the international conference on new methods in language processing (pp. 44–49).
Sharoff, S. (2005). Methods and tools for development of the Russian reference corpus. In D. Archer, A. Wilson, & P. Rayson (Eds.),Corpus linguistics around the world (pp. 167–180). Amsterdam: Rodopi.
Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. InProceedings of the sixth international conference on language resources and evaluation (LREC’08). Marrakech.
Silberztein, M. (1999). Text Indexing with INTEX. In: Computers and the humanities (vol. 33(3)). Kluwer Academic Publishers.
Simov, K., Popova, G., & Osenova, P. (2002). HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In A. Wilson, P. Rayson, & T. McEnery (Eds.),A rainbow of corpora: Corpus linguistics and the languages of the world (pp. 135–142). Munich: Lincom-Europa.
Slavcheva, M. (1997).A comparative representation of two Bulgarian morphosyntactic tagsets and the EAGLES encoding standard. Technical Report TELRI (Trans European Language Resources Infrastructure).
Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994).Guidelines for electronic text encoding and interchange P3. Chicago and Oxford: Association for Computers and the Humanities/Association for Computational Linguistics/Association for Literary and Linguistic Computing.
Steenwijk, H. (1992).The Slovene Dialect of Resia San Giorgio. Amsterdam-Atlanta: Rodopi.
Stolić, M., & Zdravkova, K. (2010). Resources for machine translation of the Macedonian language. InProceedings of the ICT innovations conference, Ohrid.
Tadić, M. (2002). Building the Croatian national corpus. InProceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 441–446). Las Palmas.
Tadić, M. (2003). Building the Croatian morphological lexicon. InProceedings of the EACL workshop on morphological processing of Slavic languages, ACL.
TEI Consortium. (2007).TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium, URL:http://www.tei-c.org/Guidelines/P5/.
Toutanova, K., & Cherry, C. (2009). A global model for joint lemmatization and part-of-speech prediction. InProceedings of the 47th annual meeting of the ACL (ACL’09) (pp. 486–494). Singapore.
Tufiş, D. (1999). Tiered tagging and combined language model classifiers. In F. Jelinek & E. Noth (Eds.),Text, speech and dialogue no. 1692 in lecture notes in artificial intelligence (pp. 28–33). Berlin: Springer.
Tufiş, D. (2002). A cheap and fast way to build useful translation lexicons. InProceedings of the 19th annual meeting of the ACL (ACL’02). Association for Computational Linguistics.
Tufiş, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives: A general overview.Romanian Journal of Information Science and Technology,7(1–2), 9–43.
Vitas, D., & Krstev, C. (2001). Intex and slavonic morphology. In4es Journées INTEX. Bordeaux.
Vojnovski, V., Džeroski, S., & Erjavec, T. (2005). Learning PoS tagging from a tagged Macedonian text corpus. InProceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.
Zampolli, A. (1997). The PAROLE project. InProceedings of the second European TELRI seminar:Language applications for multilingual Europe (pp. 185–210). Kaunas, Lithuania.
Zdravkova, K., & Petrovski, A. (2007). Derivation of Macedonian verbal adjectives. InProceedings of international conference“recent advances in natural language processing” (RANLP’07) (pp. 661–665).
Acknowledgments
The author would like to thank Radovan Garabik, Natalia Kotsyba, Katerina Zdravkova, and Darja Fišer for their helpful comments and suggestions. Work on the MULTEXT-East resources was initially supported by the EU project MULTEXT-East “Multilingual Text Tools and Corpora for Central and Eastern European Languages”, the US NSF grant IRI-9413451 and the EU Concerted Action TELRI “Trans-European Language Resources Infrastructure”. Work on the second release was supported by the EU Project CONCEDE “Consortium for Central European Dictionary Encoding”, while the work on the third release was partially funded by a the NEH grant to the TEI Task Force “SGML–XML migration”. Work on the fourth release was supported by the EU project MONDILEX “Conceptual Modeling of Networking of Centres for High-Quality Research in Slavic Lexicography and their Digital Resources”. The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia, Slovenia and Macedonia, as well as individual partners’ grants and contracts.
Author information
Authors and Affiliations
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Tomaž Erjavec
- Tomaž Erjavec
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toTomaž Erjavec.
Rights and permissions
About this article
Cite this article
Erjavec, T. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.Lang Resources & Evaluation46, 131–142 (2012). https://doi.org/10.1007/s10579-011-9174-8
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative