Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Jojajovai Guarani-Spanish Parallel Corpus

License

NotificationsYou must be signed in to change notification settings

pln-fing-udelar/jojajovai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Jojajovai is a Guarani-Spanish parallel corpus of about 30,000 sentence pairs, structured as a set of different sources.This corpus is the result of a collaboration between Guarani MT researchers from Universidad de la República, Uruguay; Universidad Nacional de Itapúa, Paraguay; Universidade Tecnológica Federal do Paraná, Brazil; Universidad de Granada, Spain; and Universitat Oberta de Catalunya, Spain.

Characteristics

The corpus is structured as a collection of subsets from different sources, further split into training, development and test sets.A sample of sentences from the test set was manually annotated by native speakers in order to incorporate meta-linguistic annotations about the Guarani dialects present in the corpus and also the correctness of the alignment and translation.

We hope this data could be used not only to train machine translation systems, but also to test them and analyze the results with different levels of granularity according to the different subsets.

SourcePairsTrainDevTest
abc16,49211,5502,4702,472
anlp2,000-9961,004
blogs2,4441,712361371
hackaton5133597777
libro_gn1,423992215216
libro_td1,016711153152
seminario2,1791,535322322
spl4,7883,348720720
Total30,85520,2075,3145,334

The filejojajovai_all.csv contains the data of the corpus.

Annotations

Three native annotators were given a sample of sentence pairs from each set, are were asked to indicate thedialect of the Guarani sentences (standard Guarani, Jopara, Jehe'a, or other possibilities), and to categorize thecorrectness of the translation pair, with the following options:

  • A: The sentences in the pair correspond completely.
  • B: The Spanish sentence has more information.
  • C: The Guarani sentence has more information.
  • D: The sentences do not match.

The filejojajovai_sample_annotations.csv contains the annotations of the sample.

Using the Data

If you use this dataset, please cite:

Luis Chiruzzo, Santiago Góngora, Aldo Alvarez, Gustavo Giménez-Lugo, Marvin Agüero-Torales, Yliana Rodríguez. (2022).Jojajovai: A Parallel Guarani-Spanish Corpus for MT Benchmarking. Proceedings of the 13th Language Resources and Evaluation Conference, LREC 2022.

You can contact us byemail at pln@fing.edu.uy.

About

Jojajovai Guarani-Spanish Parallel Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp