Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Github repo for storing LlamaDatasets

NotificationsYou must be signed in to change notification settings

run-llama/llama-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo is a companion repo to thellama-hub repomeant to be the actual storage of data files associated to a llama-dataset. Liketools, loaders, and llama-packs, llama-datasets are offered through llama-hub. Youcan view all of the available llama-hub artifacts conviently in the llama-hubwebsite.

The primary use of a llama-dataset is for evaluating the performance of a RAG system.In particular, it serves as a new test set (in traditional machine learning speak)for one to build a RAG system over, predict on, and subsequently perform evaluationscomparing the predicted responses versus the reference responses.

How to add a llama-dataset

Similar to the process of adding a tool / loader / llama-pack, adding a llama-datset also requires forking thellama-hub repoand making a Pull Request. However, for a llama-dataset, only its metadata is checked into the llama-hub repo.The actual dataset and it's source files are instead checked into this particular repo.You will need to fork and clone that repo in addition to forking and cloning this one.

Forking and cloning this repository

After forking this repo to your own Github account, the next step would be toclone from your own fork. This repository is a LFS configured repo, and so, withoutspecial care you may end up downloading large files to your local machine. As such,we ask that when the time comes to clone your fork, please ensure that when you set theenvironment variableGIT_LFS_SKIP_SMUDGE prior to calling thegit clonecommand:

# for bashGIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshGIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https# for windows its done in two commandsset GIT_LFS_SKIP_SMUDGE=1  git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshset GIT_LFS_SKIP_SMUDGE=1  git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https

To submit a llama-dataset, follow thesubmission templatenotebook.

The high-level steps are:

  1. Create aLabelledRagDataset (the initial class of llama-dataset made available on llama-hub)
  2. Generate a baseline result with a RAG system of your own choosing on theLabelledRagDataset
  3. Prepare the dataset's metadata (card.json andREADME.md)
  4. Submit a Pull Request to this repo to check in the metadata
  5. Submit a Pull Request to thellama-datasets repository to check in theLabelledRagDataset and the source files

(NOTE: you can use the above process for submitting any of our other supportedtypes of llama-datasets such as theLabelledEvaluatorDataset.)

Usage Pattern

(NOTE: in what follows we present the pattern for producing a RAG benchmark withtheRagEvaluatorPack over aLabelledRagDataset. However, there are also othertypes of llama-datasets such asLabelledEvaluatorDataset and corresponding llama-packsfor producing benchmarks on their respective tasks. They all follow the similarusage pattern. Please refer to the README's to learn more on each type ofllama-dataset.)

As mentioned earlier, llama-datasets are mainly used for evaluating RAG systems.To perform the evaluation, the recommended usage pattern involves the application of theRagEvaluatorPack. We recommend reading thedocsfor the "Evaluation" module for more information.

fromllama_index.llama_datasetimportdownload_llama_datasetfromllama_index.llama_packimportdownload_llama_packfromllama_indeximportVectorStoreIndex# download and install dependencies for benchmark datasetrag_dataset,documents=download_llama_dataset("PaulGrahamEssayDataset","./data")# build basic RAG systemindex=VectorStoreIndex.from_documents(documents=documents)query_engine=VectorStoreIndex.as_query_engine()# evaluate using the RagEvaluatorPackRagEvaluatorPack=download_llama_pack("RagEvaluatorPack","./rag_evaluator_pack")rag_evaluator_pack=RagEvaluatorPack(rag_dataset=rag_dataset,query_engine=query_engine)benchmark_df=rag_evaluate_pack.run()# async arun() supported as well

Llama-datasets can also be downloaded directly usingllamaindex-cli, which comes installed with thellama-index python package:

llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

After downloading them fromllamaindex-cli, you can inspect the dataset andit source files (stored in a directory/source_files) then load them into python:

fromllama_indeximportSimpleDirectoryReaderfromllama_index.llama_datasetimportLabelledRagDatasetrag_dataset=LabelledRagDataset.from_json("./data/rag_dataset.json")documents=SimpleDirectoryReader(input_dir="./data/source_files").load_data()

About

Github repo for storing LlamaDatasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors7


[8]ページ先頭

©2009-2025 Movatter.jp