- Notifications
You must be signed in to change notification settings - Fork3
Pre-train Static Word Embeddings
License
MinishLab/tokenlearn
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Tokenlearn is a method to pre-trainModel2Vec.
The method is described in detail in ourTokenlearn blogpost.
Install the package with:
pip install tokenlearn
The basic usage of Tokenlearn consists of two CLI scripts:featurize
andtrain
.
Tokenlearn is trained using means from a sentence transformer. To create means, thetokenlearn-featurize
CLI can be used:
python3 -m tokenlearn.featurize --model-name"baai/bge-base-en-v1.5" --output-dir"data/c4_features"
NOTE: the default model is trained on the C4 dataset. If you want to use a different dataset, the following code can be used:
python3 -m tokenlearn.featurize \ --model-name"baai/bge-base-en-v1.5" \ --output-dir"data/c4_features" \ --dataset-path"allenai/c4" \ --dataset-name"en" \ --dataset-split"train"
To train a model on the featurized data, thetokenlearn-train
CLI can be used:
python3 -m tokenlearn.train --model-name"baai/bge-base-en-v1.5" --data-path"data/c4_features" --save-path"<path-to-save-model>"
Training will create two models:
- The base trained model.
- The base model with weighting applied. This is the model that should be used for downstream tasks.
NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.
To evaluate a model, you can use the following command after installing the optional evaluation dependencies:
pip install evaluation@git+https://github.com/MinishLab/evaluation@main
frommodel2vecimportStaticModelfromevaluationimportCustomMTEB,get_tasks,parse_mteb_results,make_leaderboard,summarize_resultsfrommtebimportModelMeta# Get all available taskstasks=get_tasks()# Define the CustomMTEB object with the specified tasksevaluation=CustomMTEB(tasks=tasks)# Load a trained modelmodel_name="tokenlearn_model"model=StaticModel.from_pretrained(model_name)# Optionally, add model metadata in MTEB formatmodel.mteb_model_meta=ModelMeta(name=model_name,revision="no_revision_available",release_date=None,languages=None )# Run the evaluationresults=evaluation.run(model,eval_splits=["test"],output_folder=f"results")# Parse the results and summarize themparsed_results=parse_mteb_results(mteb_results=results,model_name=model_name)task_scores=summarize_results(parsed_results)# Print the results in a leaderboard formatprint(make_leaderboard(task_scores))
MIT
About
Pre-train Static Word Embeddings