Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Pre-train Static Word Embeddings

License

NotificationsYou must be signed in to change notification settings

MinishLab/tokenlearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenlearn is a method to pre-trainModel2Vec.

The method is described in detail in ourTokenlearn blogpost.

Quickstart

Install the package with:

pip install tokenlearn

The basic usage of Tokenlearn consists of two CLI scripts:featurize andtrain.

Tokenlearn is trained using means from a sentence transformer. To create means, thetokenlearn-featurize CLI can be used:

python3 -m tokenlearn.featurize --model-name"baai/bge-base-en-v1.5" --output-dir"data/c4_features"

NOTE: the default model is trained on the C4 dataset. If you want to use a different dataset, the following code can be used:

python3 -m tokenlearn.featurize \    --model-name"baai/bge-base-en-v1.5" \    --output-dir"data/c4_features" \    --dataset-path"allenai/c4" \    --dataset-name"en" \    --dataset-split"train"

To train a model on the featurized data, thetokenlearn-train CLI can be used:

python3 -m tokenlearn.train --model-name"baai/bge-base-en-v1.5" --data-path"data/c4_features" --save-path"<path-to-save-model>"

Training will create two models:

  • The base trained model.
  • The base model with weighting applied. This is the model that should be used for downstream tasks.

NOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.

Evaluation

To evaluate a model, you can use the following command after installing the optional evaluation dependencies:

pip install evaluation@git+https://github.com/MinishLab/evaluation@main
frommodel2vecimportStaticModelfromevaluationimportCustomMTEB,get_tasks,parse_mteb_results,make_leaderboard,summarize_resultsfrommtebimportModelMeta# Get all available taskstasks=get_tasks()# Define the CustomMTEB object with the specified tasksevaluation=CustomMTEB(tasks=tasks)# Load a trained modelmodel_name="tokenlearn_model"model=StaticModel.from_pretrained(model_name)# Optionally, add model metadata in MTEB formatmodel.mteb_model_meta=ModelMeta(name=model_name,revision="no_revision_available",release_date=None,languages=None        )# Run the evaluationresults=evaluation.run(model,eval_splits=["test"],output_folder=f"results")# Parse the results and summarize themparsed_results=parse_mteb_results(mteb_results=results,model_name=model_name)task_scores=summarize_results(parsed_results)# Print the results in a leaderboard formatprint(make_leaderboard(task_scores))

License

MIT


[8]ページ先頭

©2009-2025 Movatter.jp