Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Finetuning the Embedding Model

Try it yourself:Open In Colab

Another way to improve retriever performance is to fine-tune the embedding model itself. Fine-tuning the embedding model can help in learning better representations for the documents and queries in the dataset. This can be particularly useful when the dataset is very different from the pre-trained data used to train the embedding model.

We'll use the same dataset as in the previous sections. Start off by splitting the dataset into training and validation sets:

fromsklearn.model_selectionimporttrain_test_splittrain_df,validation_df=train_test_split("data_qa.csv",test_size=0.2,random_state=42)train_df.to_csv("data_train.csv",index=False)validation_df.to_csv("data_val.csv",index=False)

You can use any tuning API to fine-tune embedding models. In this example, we'll utilise Llama-index as it also comes with utilities for synthetic data generation and training the model.

We parse the dataset as llama-index text nodes and generate synthetic QA pairs from each node:

fromllama_index.core.node_parserimportSentenceSplitterfromllama_index.readers.fileimportPagedCSVReaderfromllama_index.finetuningimportgenerate_qa_embedding_pairsfromllama_index.core.evaluationimportEmbeddingQAFinetuneDatasetdefload_corpus(file):loader=PagedCSVReader(encoding="utf-8")docs=loader.load_data(file=Path(file))parser=SentenceSplitter()nodes=parser.get_nodes_from_documents(docs)returnnodesfromllama_index.llms.openaiimportOpenAItrain_dataset=generate_qa_embedding_pairs(llm=OpenAI(model="gpt-3.5-turbo"),nodes=train_nodes,verbose=False)val_dataset=generate_qa_embedding_pairs(llm=OpenAI(model="gpt-3.5-turbo"),nodes=val_nodes,verbose=False)

Now we'll useSentenceTransformersFinetuneEngine engine to fine-tune the model. You can also usesentence-transformers ortransformers library to fine-tune the model:

fromllama_index.finetuningimportSentenceTransformersFinetuneEnginefinetune_engine=SentenceTransformersFinetuneEngine(train_dataset,model_id="BAAI/bge-small-en-v1.5",model_output_path="tuned_model",val_dataset=val_dataset,)finetune_engine.finetune()embed_model=finetune_engine.get_finetuned_model()
This saves the fine tuned embedding model intuned_model folder.

Evaluation results

In order to eval the retriever, you can either use this model to ingest the data into LanceDB directly or llama-index's LanceDB integration to create aVectorStoreIndex and use it as a retriever. On performing the same hit-rate evaluation as before, we see a significant improvement in the hit-rate across all query types.

Baseline

Query TypeHit-rate@5
Vector Search0.640
Full-text Search0.595
Reranked Vector Search0.677
Reranked Full-text Search0.672
Hybrid Search (w/ CohereReranker)0.759

Fine-tuned model ( 2 iterations )

Query TypeHit-rate@5
Vector Search0.672
Full-text Search0.595
Reranked Vector Search0.754
Reranked Full-text Search0.672
Hybrid Search (w/ CohereReranker)0.768

[8]ページ先頭

©2009-2025 Movatter.jp