rohitgr7/embedchainPublic

forked frommem0ai/mem0

NotificationsYou must be signed in to change notification settings
Fork0
Star2

Framework to easily create LLM powered bots over any dataset.

twitter.com/taranjeetio/status/1671539269775634437

License

Apache-2.0 license

2 stars 3.7k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
embedchain		embedchain
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Repository files navigation

embedchain

embedchain is a framework to easily create LLM powered bots over any dataset. If you want a javascript version, check outembedchain-js

Latest Updates

Introduce a new app type calledOpenSourceApp. It usesgpt4all as the LLM andsentence transformers all-MiniLM-L6-v2 as the embedding model. If you use this app, you dont have to pay for anything.

What is embedchain?

Embedchain abstracts the entire process of loading a dataset, chunking it, creating embeddings and then storing in a vector database.

You can add a single or multiple dataset using.add and.add_local function and then use.query function to find an answer from the added datasets.

If you want to create a Naval Ravikant bot which has 1 youtube video, 1 book as pdf and 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the videos, pdf and blog posts and the QnA pair and embedchain will create a bot for you.

fromembedchainimportAppnaval_chat_bot=App()# Embed Online Resourcesnaval_chat_bot.add("youtube_video","https://www.youtube.com/watch?v=3qHkcs3kG44")naval_chat_bot.add("pdf_file","https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")naval_chat_bot.add("web_page","https://nav.al/feedback")naval_chat_bot.add("web_page","https://nav.al/agi")# Embed Local Resourcesnaval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?","Naval Ravikant is an Indian-American entrepreneur and investor."))naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.

Getting Started

Installation

First make sure that you have the package installed. If not, then install it usingpip

pip install embedchain

Usage

Creating a chatbot involves 3 steps:

import the App instance
add dataset
query on the dataset and get answers

App Types

We have two types of App.

1. App (uses OpenAI models, paid)

fromembedchainimportAppnaval_chat_bot=App()

App uses OpenAI's model, so these are paid models. You will be charged for embedding model usage and LLM usage.
App uses OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visitingthis link.
Once you have the API key, set it in an environment variable calledOPENAI_API_KEY

importosos.environ["OPENAI_API_KEY"]="sk-xxxx"

2. OpenSourceApp (uses opensource models, free)

fromembedchainimportOpenSourceAppnaval_chat_bot=OpenSourceApp()

OpenSourceApp uses open source embedding and LLM model. It usesall-MiniLM-L6-v2 from Sentence Transformers library as the embedding model andgpt4all as the LLM.
Here there is no need to setup any api keys. You just need to install embedchain package and these will get automatically installed.
Once you have imported and instantiated the app, every functionality from here onwards is the same for either type of app.

Add data set and query

This step assumes that you have already created anapp instance by either usingApp orOpenSourceApp. We are calling our app instance asnaval_chat_bot
Now use.add function to add any dataset.

# naval_chat_bot = App() or# naval_chat_bot = OpenSourceApp()# Embed Online Resourcesnaval_chat_bot.add("youtube_video","https://www.youtube.com/watch?v=3qHkcs3kG44")naval_chat_bot.add("pdf_file","https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")naval_chat_bot.add("web_page","https://nav.al/feedback")naval_chat_bot.add("web_page","https://nav.al/agi")# Embed Local Resourcesnaval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?","Naval Ravikant is an Indian-American entrepreneur and investor."))

If there is any other app instance in your script or app, you can change the import as

fromembedchainimportAppasEmbedChainAppfromembedchainimportOpenSourceAppasEmbedChainOSApp# orfromembedchainimportAppasECAppfromembedchainimportOpenSourceAppasECOSApp

Now your app is created. You can use.query function to get the answer for any query.

print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.

Format supported

We support the following formats:

Youtube Video

To add any youtube video to your app, use the data_type (first argument to.add) asyoutube_video. Eg:

app.add('youtube_video','a_valid_youtube_url_here')

PDF File

To add any pdf file, use the data_type aspdf_file. Eg:

app.add('pdf_file','a_valid_url_where_pdf_file_can_be_accessed')

Note that we do not support password protected pdfs.

Web Page

To add any web page, use the data_type asweb_page. Eg:

app.add('web_page','a_valid_web_page_url')

Text

To supply your own text, use the data_type astext and enter a string. The text is not processed, this can be very versatile. Eg:

app.add_local('text','Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.')

Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit.

QnA Pair

To supply your own QnA pair, use the data_type asqna_pair and enter a tuple. Eg:

app.add_local('qna_pair', ("Question","Answer"))

Reusing a Vector DB

Default behavior is to create a persistent vector DB in the directory./db. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface.

Create a local index:

fromembedchainimportAppnaval_chat_bot=App()naval_chat_bot.add("youtube_video","https://www.youtube.com/watch?v=3qHkcs3kG44")naval_chat_bot.add("pdf_file","https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")

You can reuse the local index with the same code, but without adding new documents:

fromembedchainimportAppnaval_chat_bot=App()print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))

More Formats coming soon

If you want to add any other format, please create anissue and we will add it to the list of supported formats.

Testing

Before you consume valueable tokens, you should make sure that the embedding you have done works and that it's receiving the correct document from the database.

For this you can use thedry_run method.

Following the example above, add this to your script:

print(naval_chat_bot.dry_run('Can you tell me who Naval Ravikant is?'))'''Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.        Q: Who is Naval Ravikant?A: Naval Ravikant is an Indian-American entrepreneur and investor.        Query: Can you tell me who Naval Ravikant is?        Helpful Answer:'''

The embedding is confirmed to work as expected. It returns the right document, even if the question is asked slightly different. No prompt tokens have been consumed.

The dry run will still consume tokens to embed your query, but it is only ~1/15 of the prompt.

How does it work?

Creating a chat bot over any dataset needs the following steps to happen

load the data
create meaningful chunks
create embeddings for each chunk
store the chunks in vector database

Whenever a user asks any query, following process happens to find the answer for the query

create the embedding for query
find similar documents for this query from vector database
pass similar documents as context to LLM to get the final answer.

The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own.

How should I chunk the data? What is a meaningful chunk size?
How should I create embeddings for each chunk? Which embedding model should I use?
How should I store the chunks in vector database? Which vector database should I use?
Should I store meta data along with the embeddings?
How should I find similar documents for a query? Which ranking model should I use?

These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers.

embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset.

In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using.add function and then use.query function to get the relevant answer.

Tech Stack

embedchain is built on the following stack:

Langchain as an LLM framework to load, chunk and index data
OpenAI's Ada embedding model to create embeddings
OpenAI's ChatGPT API as LLM to get answers given the context
Chroma as the vector database to store embeddings
gpt4all as an open source LLM
sentence-transformers as open source embedding model

Author

Taranjeet Singh (@taranjeetio)

Citation

If you utilize this repository, please consider citing it with:

@misc{embedchain,  author = {Taranjeet Singh},  title = {Embechain: Framework to easily create LLM powered bots over any dataset},  year = {2023},  publisher = {GitHub},  journal = {GitHub repository},  howpublished = {\url{https://github.com/embedchain/embedchain}},}

About

Framework to easily create LLM powered bots over any dataset.

twitter.com/taranjeetio/status/1671539269775634437

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

License

rohitgr7/embedchain

Folders and files

Latest commit

History

Repository files navigation

embedchain

Latest Updates

What is embedchain?

Getting Started

Installation

Usage

App Types

1. App (uses OpenAI models, paid)

2. OpenSourceApp (uses opensource models, free)

Add data set and query

Format supported

Youtube Video

PDF File

Web Page

Text

QnA Pair

Reusing a Vector DB

More Formats coming soon

Testing

How does it work?

Tech Stack

Author

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages