NotificationsYou must be signed in to change notification settings
Fork8
Star152

A library to synthesize text datasets using Large Language Models (LLM)

152 stars 8 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dist		dist
mutate		mutate
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

🦠 Mutate

A library to synthesize text datasets using Large Language Models (LLM). Mutate reads through the examples in the dataset andgenerates similar examples using auto generated few shot prompts.

1. Installation

pip install mutate-nlp

pip install git+https://github.com/infinitylogesh/mutate

2. Usage

2.1 Synthesize text data from local csv files

frommutateimportpipelinepipe=pipeline("text-classification-synthesis",model="EleutherAI/gpt-neo-2.7B",device=1)task_desc="Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"# returns a python generatortext_synth_gen=pipe("csv",data_files=["local/path/sentiment_classfication.csv"],task_desc=task_desc,text_column="text",label_column="label",text_column_alias="Comment",label_column_alias="sentiment",shot_count=5,class_names=["pos","neg"])#Loop through the generator to synthesize examples by classforsynthesized_examplesintext_synth_gen:print(synthesized_examples)

Show Output

{"text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,","I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]"label":["neg","neg"]}{"text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],"label":["pos"]}# and so on .....

2.2 Synthesize text data from 🤗 datasets

Under the hood Mutate uses the wonderful 🤗 datasets library for dataset processing, So it supports 🤗 datasets out of the box.

frommutateimportpipelinepipe=pipeline("text-classification-synthesis",model="EleutherAI/gpt-neo-2.7B",device=1)task_desc="Each item in the following contains customer service queries expressing the mentioned intent"synthesizerGen=pipe("banking77",task_desc=task_desc,text_column="text",label_column="label",# if the `text_column` doesn't have a meaningful valuetext_column_alias="Queries",label_column_alias="Intent",# if the `label_column` doesn't have a meaningful valueshot_count=5,dataset_args=["en"])forexpinsynthesizerGen:print(exp)

Show Output

{"text":["How can i know if my account has been activated? (This is the one that I am confused about)","Thanks! My card activated"],"label":["activate_my_card","activate_my_card"]}{"text": ["How do i activate this new one? Is it possible?","what is the activation process for this card?"],"label":["activate_my_card","activate_my_card"]}# and so on .....

2.3 I am feeling lucky : Infinetly loop through the dataset to generate examples indefinetly

Caution: Infinetly looping through the dataset has a higher chance of duplicate examples to be generated.

frommutateimportpipelinepipe=pipeline("text-classification-synthesis",model="EleutherAI/gpt-neo-2.7B",device=1)task_desc="Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"# returns a python generatortext_synth_gen=pipe("csv",data_files=["local/path/sentiment_classfication.csv"],task_desc=task_desc,text_column="text",label_column="label",text_column_alias="Comment",label_column_alias="sentiment",class_names=["pos","neg"],# Flag to generate indefinite examplesinfinite_loop=True)#Infinite loopforexpinsynthesizerGen:print(exp)

3. Support

3.1 Currently supports

Text classification dataset synthesis : Few Shot text data synsthesize for text classification datasets using Causal LLMs ( GPT like )

3.2 Roadmap:

Other types of text Dataset synthesis - NER , sentence pairs etc
Finetuning support for better quality generation
Pseudo labelling

4. Credit

EleutherAI for democratizing Large LMs.
This library uses 🤗Datasets and 🤗Transformers for processing datasets and models.

5. References

The Idea of generating examples from Large Language Model is inspired by the works below,

A Few More Examples May Be Worth Billions of Parameters by Yuval Kirstain, Patrick Lewis, Sebastian Riedel, Omer Levy
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation by Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, Woomyeong Park
Data Augmentation using Pre-trained Transformer Models by Varun Kumar, Ashutosh Choudhary, Eunah Cho

About

A library to synthesize text datasets using Large Language Models (LLM)

Releases

No releases published

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🦠 Mutate

1. Installation

2. Usage

2.1 Synthesize text data from local csv files

2.2 Synthesize text data from 🤗 datasets

2.3 I am feeling lucky : Infinetly loop through the dataset to generate examples indefinetly

3. Support

3.1 Currently supports

3.2 Roadmap:

4. Credit

5. References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

infinitylogesh/mutate

Folders and files

Latest commit

History

Repository files navigation

🦠 Mutate

1. Installation

2. Usage

2.1 Synthesize text data from local csv files

2.2 Synthesize text data from 🤗 datasets

2.3 I am feeling lucky : Infinetly loop through the dataset to generate examples indefinetly

3. Support

3.1 Currently supports

3.2 Roadmap:

4. Credit

5. References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages