Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Methods for Collection and Evaluation of Comparable Documents

  • Chapter
  • First Online:

Abstract

Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info
Hardcover Book
JPY 14299
Price includes VAT (Japan)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)

    Google Scholar 

  2. Appelt, D.: An introduction to information extraction. Artif. Intell. Commun.12(3), 161–172 (1999)

    Google Scholar 

  3. Argaw, A.A., Asker, L.: Web mining for an Amharic-English bilingual corpus. In Proceedings of 1st International Conference on Web Information Systems and Technologies (WEBIST 2005), Miami, USA (May 2005)

    Google Scholar 

  4. Baroni, M., Bernardini, S.: Bootstrapping corpora and terms from the web. In Proceedings of LREC (2004)

    Google Scholar 

  5. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)

    Google Scholar 

  6. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data, Science and Technology Books (2002)

    Google Scholar 

  7. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res.2(1), 263–286 (1995)

    MATH  Google Scholar 

  8. Do, T., Le, V., Bigi, B., Besacier, L., Castelli, E.: Mining a comparable text corpus for a Vietnamese-French statistical machine translation system. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 165–172. Association for Computational Linguistics (2009)

    Google Scholar 

  9. Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP, pp. 57–63 (2004)

    Google Scholar 

  10. Ghani, R., Jones, R., Mladenic, D.: Building minority language corpora by learning to generate web search queries. KAIS Knowl. Inform. Syst.7(1) (2005)

    Google Scholar 

  11. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen (June 1996).

    Google Scholar 

  12. Hassan, A., Fahmy, H., Hassan, H.: Improving named entity translation by exploiting comparable and parallel corpora. In Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop (2007)

    Google Scholar 

  13. http://techcrunch.com/2010/02/24/twitter-languages/. Accessed 1 April 2011

  14. Mohammadi, M., GhasemAghaee, N.: Building bilingual parallel corpora based on wikipedia. In: Proceedings of Second International Conference on Computer Engineering and Applications, vol. 2, pp. 264–268 (2010)

    Google Scholar 

  15. Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting comparable corpora. Comput. Linguist.31(4), 477–504 (2005)

    Google Scholar 

  16. Munteanu, D. S., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL, pp. 265–272 (2004)

    Google Scholar 

  17. Resnik, P.: Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 527–534, Morristown, NJ, USA. Association for Computational Linguistics (1999)

    Google Scholar 

  18. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  19. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM18(11), 613–620 (1975)

    Article MATH  Google Scholar 

  20. Schonfeld, E.: Costolo: Twitter now has 190 million users tweeting 65 million times a day. (2010).http://techcrunch.com/2010/06/08/twitter-190-million-users/ Accessed 1 September 2010

  21. Sparck-Jones, K., Willet, P.: Readings in Information Retrieval. Morgan Kauffmann, San Francisco (1997)

    Google Scholar 

  22. Steinberger, R., Pouliquen, B., Ignat, C.: Navigating multilingual news collections using automatically extracted information. J. Comput. Inform. Technol.13(4), 257–264 (2005)

    Article  Google Scholar 

  23. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inform. Retr.11(5), 427–445 (2008)

    Google Scholar 

  24. Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of 28th European Conference on Information Retrieval. ECIR ’06 (2006)

    Google Scholar 

Download references

Acknowledgments

The project has received funding from the ACCURAT Project, European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement no 248347.

Author information

Authors and Affiliations

  1. University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK

    Monica Lestari Paramita, David Guthrie, Evangelos Kanoulas, Rob Gaizauskas, Paul Clough & Mark Sanderson

Authors
  1. Monica Lestari Paramita

    You can also search for this author inPubMed Google Scholar

  2. David Guthrie

    You can also search for this author inPubMed Google Scholar

  3. Evangelos Kanoulas

    You can also search for this author inPubMed Google Scholar

  4. Rob Gaizauskas

    You can also search for this author inPubMed Google Scholar

  5. Paul Clough

    You can also search for this author inPubMed Google Scholar

  6. Mark Sanderson

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toMonica Lestari Paramita.

Editor information

Editors and Affiliations

  1. Centre for Translation Studies, University of Leeds, Leeds, United Kingdom

    Serge Sharoff

  2. University of Mainz, Mainz, Germany

    Reinhard Rapp

  3. Université de Paris-Sud LIMSI-CNRS, Orsay, France

    Pierre Zweigenbaum

  4. Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China

    Pascale Fung

Rights and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M. (2013). Methods for Collection and Evaluation of Comparable Documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_5

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info
Hardcover Book
JPY 14299
Price includes VAT (Japan)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp