tznurmin/TEAPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star2

Taxonomic Entity Augmentation makes biomedical texts less repetitive

License

MIT license

2 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
tea		tea
LICENSE		LICENSE
README.md		README.md
attribution.txt		attribution.txt
flake.lock		flake.lock
flake.nix		flake.nix
gen_strategy.py		gen_strategy.py
pyproject.toml		pyproject.toml

Repository files navigation

Taxonomic Entity Augmentation (TEA)

TEA is a text augmentation tool that helps prevent machine learning models from overfitting to important but repetitive content in NLP examples that use biological texts as source material. TEA targets taxonomic species names and strain names by either switching them into other valid taxonomic names automatically or by scrambling defined strain names from the text.

To see TEA in action, refer toTEA_ft repository.

Installation

You will need a Hugging Face library compatible tokenizer. You can install Transformers package from Hugging Face, which includes the required dependency. Run the following to do this:

pip install transformers

Next, clone this repository and run the following to install TEA as a Python package:

cd TEApip install.

Quickstart

The package provides two general text augmentation strategies.

To switch species:

fromtransformersimportAutoTokenizerfromteaimportTEAtokenizer=AutoTokenizer.from_pretrained('dmis-lab/biobert-base-cased-v1.2',do_lower_case=False,model_max_length=100000)tea=TEA(tokenizer)tea.switch('Hello E. coli!')# => 'Hello D. cephalotes!'

To scramble strains:

fromtransformersimportAutoTokenizerfromteaimportTEAtokenizer=AutoTokenizer.from_pretrained('dmis-lab/biobert-base-cased-v1.2',do_lower_case=False,model_max_length=100000)tea=TEA(tokenizer)tea.scramble('E. coli strain HB101 is a handy laboratory strain for molecular biology laboratory work.', ['HB101'])# => 'E. coli strain FQ414 is a handy laboratory strain for molecular biology.'# this also workstea.scramble('E. coli strain HB101 is a handy laboratory strain for molecular biology laboratory work.', ['strain HB101'])# => 'E. coli strain SW565 is a handy laboratory strain for molecular biology.'

Dataset generation

A script (gen_strategy.py) is provided for example usage of TEA as part of a more advanced dataset generation pipeline. The example script assumes thatTEA_curated_data is found from the same directory where it is run. Run the following command to download the curated data:

wget https://github.com/tznurmin/TEA_curated_data/archive/refs/tags/v1.0.tar.gz -qO -| tar -xz&& mv TEA_curated_data-1.0 TEA_curated_data

About

Taxonomic Entity Augmentation makes biomedical texts less repetitive

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Taxonomic Entity Augmentation (TEA)

Installation

Quickstart

Dataset generation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

tznurmin/TEA

Folders and files

Latest commit

History

Repository files navigation

Taxonomic Entity Augmentation (TEA)

Installation

Quickstart

Dataset generation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages