- Notifications
You must be signed in to change notification settings - Fork18
run-llama/llama-datasets
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This repo is a companion repo to thellama-hub repomeant to be the actual storage of data files associated to a llama-dataset. Liketools, loaders, and llama-packs, llama-datasets are offered through llama-hub. Youcan view all of the available llama-hub artifacts conviently in the llama-hubwebsite.
The primary use of a llama-dataset is for evaluating the performance of a RAG system.In particular, it serves as a new test set (in traditional machine learning speak)for one to build a RAG system over, predict on, and subsequently perform evaluationscomparing the predicted responses versus the reference responses.
Similar to the process of adding a tool / loader / llama-pack, adding a llama-datset also requires forking thellama-hub repoand making a Pull Request. However, for a llama-dataset, only its metadata is checked into the llama-hub repo.The actual dataset and it's source files are instead checked into this particular repo.You will need to fork and clone that repo in addition to forking and cloning this one.
After forking this repo to your own Github account, the next step would be toclone from your own fork. This repository is a LFS configured repo, and so, withoutspecial care you may end up downloading large files to your local machine. As such,we ask that when the time comes to clone your fork, please ensure that when you set theenvironment variableGIT_LFS_SKIP_SMUDGE prior to calling thegit clonecommand:
# for bashGIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshGIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https# for windows its done in two commandsset GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshset GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https
To submit a llama-dataset, follow thesubmission templatenotebook.
The high-level steps are:
- Create a
LabelledRagDataset(the initial class of llama-dataset made available on llama-hub) - Generate a baseline result with a RAG system of your own choosing on the
LabelledRagDataset - Prepare the dataset's metadata (
card.jsonandREADME.md) - Submit a Pull Request to this repo to check in the metadata
- Submit a Pull Request to thellama-datasets repository to check in the
LabelledRagDatasetand the source files
(NOTE: you can use the above process for submitting any of our other supportedtypes of llama-datasets such as theLabelledEvaluatorDataset.)
(NOTE: in what follows we present the pattern for producing a RAG benchmark withtheRagEvaluatorPack over aLabelledRagDataset. However, there are also othertypes of llama-datasets such asLabelledEvaluatorDataset and corresponding llama-packsfor producing benchmarks on their respective tasks. They all follow the similarusage pattern. Please refer to the README's to learn more on each type ofllama-dataset.)
As mentioned earlier, llama-datasets are mainly used for evaluating RAG systems.To perform the evaluation, the recommended usage pattern involves the application of theRagEvaluatorPack. We recommend reading thedocsfor the "Evaluation" module for more information.
fromllama_index.llama_datasetimportdownload_llama_datasetfromllama_index.llama_packimportdownload_llama_packfromllama_indeximportVectorStoreIndex# download and install dependencies for benchmark datasetrag_dataset,documents=download_llama_dataset("PaulGrahamEssayDataset","./data")# build basic RAG systemindex=VectorStoreIndex.from_documents(documents=documents)query_engine=VectorStoreIndex.as_query_engine()# evaluate using the RagEvaluatorPackRagEvaluatorPack=download_llama_pack("RagEvaluatorPack","./rag_evaluator_pack")rag_evaluator_pack=RagEvaluatorPack(rag_dataset=rag_dataset,query_engine=query_engine)benchmark_df=rag_evaluate_pack.run()# async arun() supported as well
Llama-datasets can also be downloaded directly usingllamaindex-cli, which comes installed with thellama-index python package:
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
After downloading them fromllamaindex-cli, you can inspect the dataset andit source files (stored in a directory/source_files) then load them into python:
fromllama_indeximportSimpleDirectoryReaderfromllama_index.llama_datasetimportLabelledRagDatasetrag_dataset=LabelledRagDataset.from_json("./data/rag_dataset.json")documents=SimpleDirectoryReader(input_dir="./data/source_files").load_data()
About
Github repo for storing LlamaDatasets
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors7
Uh oh!
There was an error while loading.Please reload this page.