NotificationsYou must be signed in to change notification settings
Fork18
Star38

Github repo for storing LlamaDatasets

38 stars 18 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
llama_agents		llama_agents
llama_datasets		llama_datasets
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Repository files navigation

Llama Datasets 🦙📝

This repo is a companion repo to thellama-hub repomeant to be the actual storage of data files associated to a llama-dataset. Liketools, loaders, and llama-packs, llama-datasets are offered through llama-hub. Youcan view all of the available llama-hub artifacts conviently in the llama-hubwebsite.

The primary use of a llama-dataset is for evaluating the performance of a RAG system.In particular, it serves as a new test set (in traditional machine learning speak)for one to build a RAG system over, predict on, and subsequently perform evaluationscomparing the predicted responses versus the reference responses.

How to add a llama-dataset

Similar to the process of adding a tool / loader / llama-pack, adding a llama-datset also requires forking thellama-hub repoand making a Pull Request. However, for a llama-dataset, only its metadata is checked into the llama-hub repo.The actual dataset and it's source files are instead checked into this particular repo.You will need to fork and clone that repo in addition to forking and cloning this one.

Forking and cloning this repository

After forking this repo to your own Github account, the next step would be toclone from your own fork. This repository is a LFS configured repo, and so, withoutspecial care you may end up downloading large files to your local machine. As such,we ask that when the time comes to clone your fork, please ensure that when you set theenvironment variableGIT_LFS_SKIP_SMUDGE prior to calling thegit clonecommand:

# for bashGIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshGIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https# for windows its done in two commandsset GIT_LFS_SKIP_SMUDGE=1  git clone git@github.com:<your-github-user-name>/llama-datasets.git# for sshset GIT_LFS_SKIP_SMUDGE=1  git clone https://github.com/<your-github-user-name>/llama-datasets.git# for https

To submit a llama-dataset, follow thesubmission templatenotebook.

The high-level steps are:

Create aLabelledRagDataset (the initial class of llama-dataset made available on llama-hub)
Generate a baseline result with a RAG system of your own choosing on theLabelledRagDataset
Prepare the dataset's metadata (card.json andREADME.md)
Submit a Pull Request to this repo to check in the metadata
Submit a Pull Request to thellama-datasets repository to check in theLabelledRagDataset and the source files

(NOTE: you can use the above process for submitting any of our other supportedtypes of llama-datasets such as theLabelledEvaluatorDataset.)

Usage Pattern

(NOTE: in what follows we present the pattern for producing a RAG benchmark withtheRagEvaluatorPack over aLabelledRagDataset. However, there are also othertypes of llama-datasets such asLabelledEvaluatorDataset and corresponding llama-packsfor producing benchmarks on their respective tasks. They all follow the similarusage pattern. Please refer to the README's to learn more on each type ofllama-dataset.)

As mentioned earlier, llama-datasets are mainly used for evaluating RAG systems.To perform the evaluation, the recommended usage pattern involves the application of theRagEvaluatorPack. We recommend reading thedocsfor the "Evaluation" module for more information.

fromllama_index.llama_datasetimportdownload_llama_datasetfromllama_index.llama_packimportdownload_llama_packfromllama_indeximportVectorStoreIndex# download and install dependencies for benchmark datasetrag_dataset,documents=download_llama_dataset("PaulGrahamEssayDataset","./data")# build basic RAG systemindex=VectorStoreIndex.from_documents(documents=documents)query_engine=VectorStoreIndex.as_query_engine()# evaluate using the RagEvaluatorPackRagEvaluatorPack=download_llama_pack("RagEvaluatorPack","./rag_evaluator_pack")rag_evaluator_pack=RagEvaluatorPack(rag_dataset=rag_dataset,query_engine=query_engine)benchmark_df=rag_evaluate_pack.run()# async arun() supported as well

Llama-datasets can also be downloaded directly usingllamaindex-cli, which comes installed with thellama-index python package:

llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

After downloading them fromllamaindex-cli, you can inspect the dataset andit source files (stored in a directory/source_files) then load them into python:

fromllama_indeximportSimpleDirectoryReaderfromllama_index.llama_datasetimportLabelledRagDatasetrag_dataset=LabelledRagDataset.from_json("./data/rag_dataset.json")documents=SimpleDirectoryReader(input_dir="./data/source_files").load_data()

About

Github repo for storing LlamaDatasets

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Llama Datasets 🦙📝

How to add a llama-dataset

Forking and cloning this repository

Usage Pattern

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors7

Uh oh!

Languages

Movatterモバイル変換

run-llama/llama-datasets

Folders and files

Latest commit

History

Repository files navigation

Llama Datasets 🦙📝

How to add a llama-dataset

Forking and cloning this repository

Usage Pattern

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors7

Uh oh!

Languages

Packages