- Monica Lestari Paramita10,
- Ahmet Aker10,
- Paul Clough10,
- Robert Gaizauskas10,
- Nikos Glaros11,
- Nikos Mastropavlos11,
- Olga Yannoutsou11,
- Radu Ion12,
- Dan Ștefănescu12,
- Alexandru Ceauşu12,
- Dan Tufiș12 &
- …
- Judita Preiss10
Part of the book series:Theory and Applications of Natural Language Processing ((NLP))
436Accesses
Abstract
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 16015
- Price includes VAT (Japan)
- Hardcover Book
- JPY 20019
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For example, “letter of credit” in English may be translated in Dutch as “accreditief” or “kredietbrief” (based on using Eurowordnet).
- 2.
Wikipedia inter-language links connect documents from different languages that describe the same topic.
- 3.
Good-quality articles are those that senior Wikipedia moderators and the Romanian Wikipedia community think to be complete, well written, with good references, etc.
- 4.
We used a clean dictionary containing more than 1.5 million entries for RO-EN; for other language pairs, dictionaries were built in the ACCURAT project using GIZA++.
- 5.
Wikipedia Extractor tool is available for download in the ACCURAT project website (Paramita et al.2012).
- 6.
For named entity parsing, we use OpenNLP tools:http://incubator.apache.org/opennlp/
- 7.
Boilerpipe—http://code.google.com/p/boilerpipe/—is used to extract the textual content from the URL.
- 8.
Titles, which have less than five content words, are not taken into consideration.
- 9.
- 10.
- 11.
Have all topic-core-terms been included? Have other terms effectively pointing to the topic also been included? Does the topic definition file contain multi-word strong topic indicators? Have all terms been ranked consistently?
- 12.
Do seed URLs in the source language and seed URLs in the target language address highly comparable Web documents? Have multilingual sites, if any, been included?
- 13.
The longer the better, especially in cases where there are not too many Web documents relevant to the topic selected.
- 14.
For example, increase “Minimum unique terms that must exist in clean content” from default value of 3–5.
- 15.
LDA modeling can abstract a model from a relatively small corpus and a tenth of the original Reuters corpus is much more manageable in terms of memory and requirements.
References
ACCURAT Deliverable: D3.3, D3.4, D3.5.
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia.Proceedings of the EACL Workshop on New Text, Trento, Italy.
Aker, A., Kanoulas, E., & Gaizauskas, R. (2012). A light way to collect comparable corpora from the Web.Proceedings of LREC 2012, 21–27 May, Istanbul, Turkey.
Ardö, A., & Golub, K. (2007).Documentation for the Combine (Focused) Crawling System.http://combine.it.lth.se/documentation/DocMain/
Argaw, A. A., & Asker, L. (2005). Web mining for an amharic-english bilingual corpus.Proceedings of the 1st International Conference on Web Information Systems and Technologies, WEBIST ’05 (pp. 239–246). INSTICC Press.
Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web.Proceedings of LREC 2004 (pp. 1313–1316).
Barzilay, R., & McKeown, K. R. (2001). Extracting paraphrases from a parallel corpus.ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 50–57). Association for Computational Linguistics, Morristown, NJ.
Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia.Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12), ACM, New York, NY.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.The Journal of Machine Learning Research, 3, 993–1022.
Braschler, P. S. (1998). Multilingual information retrieval based on document alignment techniques.Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98, Heraklion, Crete, Cyprus, September 21–23, 1998: Proceedings, 183. Springer.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine.Computer Networks and ISDN Systems, 30(1–7), 107–117.
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases.Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 17–24). Association for Computational Linguistics, Morristown, NJ.
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization.Ann Arbor MI, 48113(2), 161–175.
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002, May). Accelerated focused crawling through online relevance feedback.Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). ACM.
Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering.Computer Networks and ISDN Systems, 30(1–7), 161–172.
De Bra, P. M. E., & Post, R. D. J. (1994). Information retrieval in the World-Wide Web: Making client-based searching feasible.Computer Networks and ISDN Systems, 27(2), 183–192.
Dimalen, D. M. D., & Roxas, R. (2007). AutoCor: A query based automatic acquisition of corpora of closely-related languages.Proceedings of the 21st PACLIC (pp. 146–154).
Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor.The Prague Bulletin of Mathematical Linguistics, 93, 77–86.
Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia.Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 ’09).
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em.Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP ’04 (pp. 57–63), Citeseer.
Gamallo, P., & Garcia, M. (2012). Extraction of bilingual cognates from Wikipedia.Computational Processing of the Portuguese Language (pp. 63–72). Springer.
Ghani, R., Jones, R., & Mladenic, D. (2005). Building minority language corpora by learning to generate web search queries.Knowledge and Information Systems, 7(1), 56–83.
Hassan, A., Fahmy, H., & Hassan, H. (2007). Improving named entity translation by exploiting comparable and parallel corpora.Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop.
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm—An application: Tailored Web site mapping.Computer Networks and ISDN Systems, 30(1–7), 317–326.
Huang, D., Zhao, L., Li, L., & Yu, H. (2010). Mining large-scale comparable corpora from Chinese-English news collections.Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 472–480). Association for Computational Linguistics.
Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., & Ştefănescu, D. (2010). On-line compilation of comparable corpora and their evaluation.Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7) (pp. 29–34). Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, October 2010.
Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation.Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 455–462). Association for Computational Linguistics, Morristown, NJ.
Koehn, P. (2009).Statistical machine translation. Cambridge University Press.
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features.The Third ACM International Conference on Web Search and Data Mining.
Kumano, T., Tanaka, H., & Tokunaga, T. (2007). Extracting phrasal alignments from comparable corpora by using joint probability SMT model.Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (pp. 95–103).
Lü, Y., Huang, J., & Liu, Q. (2007, June). Improving statistical machine translation performance by training data selection and optimization.EMNLP-CoNLL (Vol. 34, pp. 3–350).
Marton, Y., Callison-Burch, C., Resnik, P. (2009). Improved statistical machine translation using monolingually-derived paraphrases.Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 381–390). Association for Computational Linguistics.
Mastropavlos, N., & Papavassiliou, V. (2011). Automatic acquisition of bilingual language resources.Proceedings of the 10th International Conference on Greek Linguistics, Komotini, Greece
Menczer, F., & Belew, R. (2000). Adaptive retrieval agents: Internalizing local context and scaling up to the Web.Machine Learning, 39(2–3), 203–242.
Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees.EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 289–295). Association for Computational Linguistics, Morristown, NJ.
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora.Computational Linguistics, 31(4), 477–504.
Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora.ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Morristown, NJ.
Nakov, P. (2008). Paraphrasing verbs for noun compound interpretation.Proceedings of the Workshop on Multiword Expressions, LREC-2008.
Paramita, M., Clough, P., Aker, A., & Gaizauskas, R. (2012). Correlation between similarity measures for inter-language linked Wikipedia articles.Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012) (pp. 790–797), Istanbul, Turkey.
Passerini, A., Frasconi, P., & Soda, G. (2001). Evaluation methods for focused crawling, Lecture Notes in Computer Science 2175, pp. 33–45.
Phan, X. H., Nguyen, L. M., & Horiguchi, S. (2008, April). Learning to classify short and sparse text and web with hidden topics from large-scale data collections.Proceedings of the 17th International Conference on World Wide Web (pp. 91–100). ACM.
Pinkerton, B. (1994). Finding what people want: Experiences with the Web Crawler.Proceedings of the 2nd International World Wide Web Conference.
Preiss, J. (2012). Identifying comparable corpora using LDA.Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT ‘12) (pp. 558–562). Association for Computational Linguistics, Stroudsburg, PA.
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora.Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics.
Resnik, P. (1998). Parallel strands: A preliminary investigation into mining the web for bilingual text. In D. Farwell, L. Gerber, & E. Hovy (Eds.),Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas (AMTA-98), Langhorne, PA, Lecture Notes in Artificial Intelligence 1529, Springer, October, 1998.
Resnik, P. (1999). Mining the web for bilingual text.Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 527–534). Association for Computational Linguistics.
Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The Reuters corpus volume 1 – from yesterday’s news to tomorrow’s language resources.Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 827–832).
Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators.Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 739–746). Association for Computational Linguistics, Morristown, NJ.
Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In A. Gawman, E. Kidd, & P-Å. Larson (Eds.),Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing (CASCON ’93) (Vol. 2, pp. 1071–1082). IBM Press.
Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. InNAACL-HLT (pp. 403–411).
Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information.Journal of Computing and Information Technology, 13(4), 257–264.
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora.Information Retrieval, 11(5), 427–445.
Theobald, M., Siddharth, J., & Paepcke, A. (2008). SpotSigs: Robust and efficient near duplicate detection in large web collections.31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008).
Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J., (2001). Mining Wikipedia as a parallel and comparable corpus.Language Forum (Vol. 34, No. 1, pp. 123–137). Bahri Publications.
Uszkoreit, J., Ponte, J. M., Popat, A. C., & Dubiner, M. (2010, August). Large scale parallel document mining for machine translation.Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1101–1109). Association for Computational Linguistics.
Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity.Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 121–124). Association for Computational Linguistics, Stroudsburg, PA.
Zhao, S., Niu, C., Zhou, M., Liu, T., & Li, S. (2008, June). Combining multiple resources to improve SMT-based paraphrasing model.Proceedings of ACL-08: HLT (pp. 1021–1029). Association for Computational Linguistics, Columbus, OH.
Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web.Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.
Author information
Authors and Affiliations
University of Sheffield, Sheffield, UK
Monica Lestari Paramita, Ahmet Aker, Paul Clough, Robert Gaizauskas & Judita Preiss
Institute for Language and Speech Processing (ILSP), Athens, Greece
Nikos Glaros, Nikos Mastropavlos & Olga Yannoutsou
Research Institute for Artificial Intelligence, Romanian Academy Center for Artificial Intelligence (RACAI), Bucharest, Romania
Radu Ion, Dan Ștefănescu, Alexandru Ceauşu & Dan Tufiș
- Monica Lestari Paramita
You can also search for this author inPubMed Google Scholar
- Ahmet Aker
You can also search for this author inPubMed Google Scholar
- Paul Clough
You can also search for this author inPubMed Google Scholar
- Robert Gaizauskas
You can also search for this author inPubMed Google Scholar
- Nikos Glaros
You can also search for this author inPubMed Google Scholar
- Nikos Mastropavlos
You can also search for this author inPubMed Google Scholar
- Olga Yannoutsou
You can also search for this author inPubMed Google Scholar
- Radu Ion
You can also search for this author inPubMed Google Scholar
- Dan Ștefănescu
You can also search for this author inPubMed Google Scholar
- Alexandru Ceauşu
You can also search for this author inPubMed Google Scholar
- Dan Tufiș
You can also search for this author inPubMed Google Scholar
- Judita Preiss
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toRobert Gaizauskas.
Editor information
Editors and Affiliations
Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs
Additional information
Chapter editors: Robert Gaizauskas and Monica Lestari Paramita
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Paramita, M.L.et al. (2019). Collecting Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_3
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-319-99003-3
Online ISBN:978-3-319-99004-0
eBook Packages:Computer ScienceComputer Science (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative