- Notifications
You must be signed in to change notification settings - Fork1
Easily turn large English text datasets into Japanese text datasets using open LLMs.
License
llm-jp/text2dataset
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Easily turn large English text datasets into Japanese text datasets using open LLMs.
Fig: Japanese translation of theAbirate/english_quotes dataset using thellm-jp/llm-jp-3-3.7b-instruct model.text2dataset is a tool designed to convert datasets by translating the data in the "txt" column using an Open LLM, such as gemma2 with vLLM, and adding a new column called "txt_ja" that contains the translated text in Japanese.
By utilizing the fast LLM inference libraryvLLM, this tool enables the fast translation of large English datasets into Japanese.You can also use text2dataset for any translation tasks (e.g. paraphrase) by modifying the prompt template accordingly.
This tool is inspired byimg2dataset.
- Save the intermediate results in shards:
- By setting the
number_sample_per_shard
parameter, the dataset can be saved in shards as specified by the number of samples per shard.
- By setting the
- Resume from checkpoint:
- By setting the
resume_from_checkpoint
parameter, the translation can be resumed from where it left off.
- By setting the
- Logging with wandb:
- By setting the
use_wandb
parameter, the metrics such as examples_per_sec and count can be logged to wandb.
- By setting the
- Push to Hugging Face Hub:
- By setting the
push_to_hub
parameter, the translated dataset can be pushed to the Hugging Face Hub.
- By setting the
- Custom Prompt Template:
- By specifying the
prompt_template_path
parameter, you can customize the prompt template for any translation task (e.g., paraphrasing, summarization etc.).
- By specifying the
$ git clone https://github.com/llm-jp/text2dataset.git$cd text2dataset$ rye sync
$ python src/text2dataset/main.py \ --model_id llm-jp/llm-jp-3-3.7b-instruct \ --batch_size 16384 \ --input_path data/english_quotes.json \ --source_column text \ --target_column text_ja \ --push_to_hub True \ --push_to_hub_path speed/english_quotes_ja \ --output_dir data/english_quotes_ja \ --output_format json
Using thellm-jp/llm-jp-3-3.7b-instruct
model on an A100 GPU, 2508 English quotes were translated into Japanese in just 21 seconds.
The result dataset is available atspeed/english_quotes_ja.
You can also use text2dataset to paraphrase texts by changing the prompt template with specifying theprompt_template_path
parameter.
$ python src/text2dataset/main.py \ --model_id google/gemma-2-2b-it \ --batch_size 16384 \ --input_path data/english_quotes.json \ --source_column text \ --target_column text_paraphrase \ --push_to_hub True \ --push_to_hub_path speed/english_quotes_paraphrase \ --output_dir data/english_quotes_paraphrase \ --output_format json \ --prompt_template_path config/paraphrase.yaml
The result dataset is available atspeed/english_quotes_paraphrase.
Translation ofneuralwork/arxiver dataset
You can directly translate datasets in Hugging Face by specifying the path name ininput_path
.
In this example, theabstract
column of theneuralwork/arxiver
dataset is translated by specifying theinput_path
asneuralwork/arxiver
and thesource_column
parameter asabstract
.
$ python src/text2dataset/main.py \ --model_id google/gemma-2-2b-it \ --batch_size 16384 \ --input_path neuralwork/arxiver \ --source_column abstract \ --target_column abstract_ja \ --push_to_hub True \ --push_to_hub_path speed/arxiver_ja \ --output_dir data/arxiver_ja \ --output_format json \ --use_wandb True \ --wandb_run_name arxiver
neuralwork/arxiver
dataset contains 138k rows of abstracts, and it took 2.5 hours to translate them into Japanese using thegoogle/gemma-2-2b-it
model on a A100 GPU. The result dataset is available atspeed/arxiver_ja.
- Translation on Multiple GPUs in Parallel
To run translations on multiple GPUs concurrently, split the input dataset into several shards (directories) and execute the translation for each shard in parallel. Remember to set the gpu_id parameter to the corresponding GPU ID for each shard.
Currently, we need to manually split the input dataset into shards and run the translation for each shard in parallel to utilize multiple GPUs. It would be great to have a built-in feature to automatically split the input dataset into shards and run the translation on multiple GPUs in parallel.If you have any ideas or suggestions, please feel free to open an issue or Pull Request.
When using this tool, please pay attention to the license of both the dataset being translated and the LLM you use.
Welcome to any contributions!If you have any questions or suggestions, please feel free to open an issue or Pull Request.
git tag -a v0.x.x -m"version 0.x.x"git push origin --tags
$ rye lint$ rye format
About
Easily turn large English text datasets into Japanese text datasets using open LLMs.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.