phidata

phidata is a framework for buildingAI Assistants with long-term memory, contextual knowledge, and the ability to take actions using function calling. It helps turn general-purpose LLMs into specialized assistants tailored to your use case by extending its capabilities usingmemory,knowledge, andtools.

Memory: Stores chat history in adatabase and enables LLMs to have long-term conversations.
Knowledge: Stores information in avector database and provides LLMs with business context. (Here we will use LanceDB)
Tools: Enable LLMs to take actions like pulling data from anAPI,sending emails orquerying a database, etc.

example

Memory & knowledge make LLMs smarter while tools make them autonomous.

LanceDB is a vector database and its integration into phidata makes it easy for us to provide aknowledge base to LLMs. It enables us to store information asembeddings and search for theresults similar to ours usingquery.

What is Knowledge Base?

Knowledge Base is a database of information that the Assistant can search to improve its responses. This information is stored in a vector database and provides LLMs with business context, which makes them respond in a context-aware manner.

While any type of storage can act as a knowledge base, vector databases offer the best solution for retrieving relevant results from dense information quickly.

Let's see how using LanceDB inside phidata helps in making LLM more useful:

Prerequisites: install and import necessary dependencies

Create a virtual environment

install virtualenv package
```
pipinstallvirtualenv
```
Create a directory for your project and go to the directory and create a virtual environment inside it.
```
mkdirphi
```
```
cdphi
```
```
python-mvenvphidata_
```

Activating virtual environment

from inside the project directory, run the following command to activate the virtual environment.
```
phidata_/Scripts/activate
```

Install the following packages in the virtual environment

pipinstalllancedbphidatayoutube_transcript_apiopenaiollamanumpypandas

Create python files and import necessary libraries

You need to create two files -transcript.py andollama_assistant.py oropenai_assistant.py

openai_assistant.pyollama_assistant.pytranscript.py

importos,openaifromrich.promptimportPromptfromphi.assistantimportAssistantfromphi.knowledge.textimportTextKnowledgeBasefromphi.vectordb.lancedbimportLanceDbfromphi.llm.openaiimportOpenAIChatfromphi.embedder.openaiimportOpenAIEmbedderfromtranscriptimportextract_transcriptif"OPENAI_API_KEY"notinos.environ:# OR set the key here as a variableopenai.api_key="sk-..."# The code below creates a file "transcript.txt" in the directory, the txt file will be used belowyoutube_url="https://www.youtube.com/watch?v=Xs33-Gzl8Mo"segment_duration=20transcript_text,dict_transcript=extract_transcript(youtube_url,segment_duration)

fromrich.promptimportPromptfromphi.assistantimportAssistantfromphi.knowledge.textimportTextKnowledgeBasefromphi.vectordb.lancedbimportLanceDbfromphi.llm.ollamaimportOllamafromphi.embedder.ollamaimportOllamaEmbedderfromtranscriptimportextract_transcript# The code below creates a file "transcript.txt" in the directory, the txt file will be used belowyoutube_url="https://www.youtube.com/watch?v=Xs33-Gzl8Mo"segment_duration=20transcript_text,dict_transcript=extract_transcript(youtube_url,segment_duration)

fromyoutube_transcript_apiimportYouTubeTranscriptApiimportredefsmodify(seconds):hours,remainder=divmod(seconds,3600)minutes,seconds=divmod(remainder,60)returnf"{int(hours):02}:{int(minutes):02}:{int(seconds):02}"defextract_transcript(youtube_url,segment_duration):# Extract video ID from the URLvideo_id=re.search(r'(?<=v=)[\w-]+',youtube_url)ifnotvideo_id:video_id=re.search(r'(?<=be/)[\w-]+',youtube_url)ifnotvideo_id:returnNonevideo_id=video_id.group(0)# Attempt to fetch the transcripttry:# Try to get the official transcripttranscript=YouTubeTranscriptApi.get_transcript(video_id,languages=['en'])exceptException:# If no official transcript is found, try to get auto-generated transcripttry:transcript_list=YouTubeTranscriptApi.list_transcripts(video_id)fortranscriptintranscript_list:transcript=transcript.translate('en').fetch()exceptException:returnNone# Format the transcript into 120s chunkstranscript_text,dict_transcript=format_transcript(transcript,segment_duration)# Open the file in write mode, which creates it if it doesn't existwithopen("transcript.txt","w",encoding="utf-8")asfile:file.write(transcript_text)returntranscript_text,dict_transcriptdefformat_transcript(transcript,segment_duration):chunked_transcript=[]chunk_dict=[]current_chunk=[]current_time=0# 2 minutes in secondsstart_time_chunk=0# To track the start time of the current chunkforsegmentintranscript:start_time=segment['start']end_time_x=start_time+segment['duration']text=segment['text']# Add text to the current chunkcurrent_chunk.append(text)# Update the current time with the duration of the current segment# The duration of the current segment is given by segment['start'] - start_time_chunkifcurrent_chunk:current_time=start_time-start_time_chunk# If current chunk duration reaches or exceeds 2 minutes, save the chunkifcurrent_time>=segment_duration:# Use the start time of the first segment in the current chunk as the timestampchunked_transcript.append(f"[{smodify(start_time_chunk)} to{smodify(end_time_x)}] "+" ".join(current_chunk))current_chunk=re.sub(r'[\xa0\n]',lambdax:''ifx.group()=='\xa0'else' ',"\n".join(current_chunk))chunk_dict.append({"timestamp":f"[{smodify(start_time_chunk)} to{smodify(end_time_x)}]","text":"".join(current_chunk)})current_chunk=[]# Reset the chunkstart_time_chunk=start_time+segment['duration']# Update the start time for the next chunkcurrent_time=0# Reset current time# Add any remaining text in the last chunkifcurrent_chunk:chunked_transcript.append(f"[{smodify(start_time_chunk)} to{smodify(end_time_x)}] "+" ".join(current_chunk))current_chunk=re.sub(r'[\xa0\n]',lambdax:''ifx.group()=='\xa0'else' ',"\n".join(current_chunk))chunk_dict.append({"timestamp":f"[{smodify(start_time_chunk)} to{smodify(end_time_x)}]","text":"".join(current_chunk)})return"\n\n".join(chunked_transcript),chunk_dict

Warning

If creating Ollama assistant, download and install Ollamafrom here and then run the Ollama instance in the background. Also, download the required models usingollama pull <model-name>. Check out the modelshere

Run the following command to deactivate the virtual environment if needed

deactivate

Step 1 - Create a Knowledge Base for AI Assistant using LanceDB

openai_assistant.pyollama_assistant.py

# Create knowledge Base with OpenAIEmbedder in LanceDBknowledge_base=TextKnowledgeBase(path="transcript.txt",vector_db=LanceDb(embedder=OpenAIEmbedder(api_key=openai.api_key),table_name="transcript_documents",uri="./t3mp/.lancedb",),num_documents=10)

# Create knowledge Base with OllamaEmbedder in LanceDBknowledge_base=TextKnowledgeBase(path="transcript.txt",vector_db=LanceDb(embedder=OllamaEmbedder(model="nomic-embed-text",dimensions=768),table_name="transcript_documents",uri="./t2mp/.lancedb",),num_documents=10)

Check out the list ofembedders supported byphidata and their usagehere.

Here we have usedTextKnowledgeBase, which loads text/docx files to the knowledge base.

Let's see all the parameters thatTextKnowledgeBase takes -

Name	Type	Purpose	Default
`path`	`Union[str, Path]`	Path to text file(s). It can point to a single text file or a directory of text files.	provided by user
`formats`	`List[str]`	File formats accepted by this knowledge base.	`[".txt"]`
`vector_db`	`VectorDb`	Vector Database for the Knowledge Base. phidata provides a wrapper around many vector DBs, you can import it like this -`from phi.vectordb.lancedb import LanceDb`	provided by user
`num_documents`	`int`	Number of results (documents/vectors) that vector search should return.	`5`
`reader`	`TextReader`	phidata provides many types of reader objects which read data, clean it and create chunks of data, encapsulate each chunk inside an object of the`Document` class, and return`List[Document]`.	`TextReader()`
`optimize_on`	`int`	It is used to specify the number of documents on which to optimize the vector database. Supposed to create an index.	`1000`

Wonder! What isDocument class?

We know that, before storing the data in vectorDB, we need to split the data into smaller chunks upon which embeddings will be created and these embeddings along with the chunks will be stored in vectorDB. When the user queries over the vectorDB, some of these embeddings will be returned as the result based on the semantic similarity with the query.

When the user queries over vectorDB, the queries are converted into embeddings, and a nearest neighbor search is performed over these query embeddings which returns the embeddings that correspond to most semantically similar chunks(parts of our data) present in vectorDB.

Here, a “Document” is a class in phidata. Since there is an option to let phidata create and manage embeddings, it splits our data into smaller chunks(as expected). It does not directly create embeddings on it. Instead, it takes each chunk and encapsulates it inside the object of theDocument class along with various other metadata related to the chunk. Then embeddings are created on theseDocument objects and stored in vectorDB.

classDocument(BaseModel):"""Model for managing a document"""content:str# <--- here data of chunk is storedid:Optional[str]=Nonename:Optional[str]=Nonemeta_data:Dict[str,Any]={}embedder:Optional[Embedder]=Noneembedding:Optional[List[float]]=Noneusage:Optional[Dict[str,Any]]=None

However, using phidata you can load many other types of data in the knowledge base(other than text). Check outphidata Knowledge Base for more information.

Let's dig deeper into thevector_db parameter and see what parametersLanceDb takes -

Name	Type	Purpose	Default
`embedder`	`Embedder`	phidata provides many Embedders that abstract the interaction with embedding APIs and utilize it to generate embeddings. Check out other embeddershere	`OpenAIEmbedder`
`distance`	`List[str]`	The choice of distance metric used to calculate the similarity between vectors, which directly impacts search results and performance in vector databases.	`Distance.cosine`
`connection`	`lancedb.db.LanceTable`	LanceTable can be accessed through`.connection`. You can connect to an existing table of LanceDB, created outside of phidata, and utilize it. If not provided, it creates a new table using`table_name` parameter and adds it to`connection`.	`None`
`uri`	`str`	It specifies the directory location ofLanceDB database and establishes a connection that can be used to interact with the database.	`"/tmp/lancedb"`
`table_name`	`str`	If`connection` is not provided, it initializes and connects to a newLanceDB table with a specified(or default) name in the database present at`uri`.	`"phi"`
`nprobes`	`int`	It refers to the number of partitions that the search algorithm examines to find the nearest neighbors of a given query vector. Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.	`20`

Note

Since we just initialized the KnowledgeBase. The VectorDB table that corresponds to this Knowledge Base is not yet populated with our data. It will be populated inStep 3, once we perform theload operation.

You can check the state of the LanceDB table using -knowledge_base.vector_db.connection.to_pandas()

Now that the Knowledge Base is initialized, , we can go tostep 2.

Step 2 - Create an assistant with our choice of LLM and reference to the knowledge base.

openai_assistant.pyollama_assistant.py

# define an assistant with gpt-4o-mini llm and reference to the knowledge base created aboveassistant=Assistant(llm=OpenAIChat(model="gpt-4o-mini",max_tokens=1000,temperature=0.3,api_key=openai.api_key),description="""You are an Expert in explaining youtube video transcripts. You are a bot that takes transcript of a video and answer the question based on it.    This is transcript for the above timestamp:{relevant_document}    The user input is:{user_input}    generate highlights only when asked.    When asked to generate highlights from the video, understand the context for each timestamp and create key highlight points, answer in following way -    [timestamp] - highlight 1    [timestamp] - highlight 2    ... so on    Your task is to understand the user question, and provide an answer using the provided contexts. Your answers are correct, high-quality, and written by an domain expert. If the provided context does not contain the answer, simply state,'The provided context does not have the answer.'""",knowledge_base=knowledge_base,add_references_to_prompt=True,)

# define an assistant with llama3.1 llm and reference to the knowledge base created aboveassistant=Assistant(llm=Ollama(model="llama3.1"),description="""You are an Expert in explaining youtube video transcripts. You are a bot that takes transcript of a video and answer the question based on it.    This is transcript for the above timestamp:{relevant_document}    The user input is:{user_input}    generate highlights only when asked.    When asked to generate highlights from the video, understand the context for each timestamp and create key highlight points, answer in following way -    [timestamp] - highlight 1    [timestamp] - highlight 2    ... so on    Your task is to understand the user question, and provide an answer using the provided contexts. Your answers are correct, high-quality, and written by an domain expert. If the provided context does not contain the answer, simply state,'The provided context does not have the answer.'""",knowledge_base=knowledge_base,add_references_to_prompt=True,)

Assistants addmemory,knowledge, andtools to LLMs. Here we will add onlyknowledge in this example.

Whenever we will give a query to LLM, the assistant will retrieve relevant information from ourKnowledge Base(table in LanceDB) and pass it to LLM along with the user query in a structured way.

Theadd_references_to_prompt=True always adds information from the knowledge base to the prompt, regardless of whether it is relevant to the question.

To know more about an creating assistant in phidata, check outphidata docs here.

Step 3 - Load data to Knowledge Base.

# load out data into the knowledge_base (populating the LanceTable)assistant.knowledge_base.load(recreate=False)

The above code loads the data to the Knowledge Base(LanceDB Table) and now it is ready to be used by the assistant.

Name	Type	Purpose	Default
`recreate`	`bool`	If True, it drops the existing table and recreates the table in the vectorDB.	`False`
`upsert`	`bool`	If True and the vectorDB supports upsert, it will upsert documents to the vector db.	`False`
`skip_existing`	`bool`	If True, skips documents that already exist in the vectorDB when inserting.	`True`

What is upsert?

Upsert is a database operation that combines "update" and "insert". It updates existing records if a document with the same identifier does exist, or inserts new records if no matching record exists. This is useful for maintaining the most current information without manually checking for existence.

During the Load operation, phidata directly interacts with the LanceDB library and performs the loading of the table with our data in the following steps -

Creates andinitializes the table if it does not exist.
Then itsplits our data into smallerchunks.
How do they create chunks?
phidata provides many types ofKnowledge Bases based on the type of data. Most of them has a property method calleddocument_lists of typeIterator[List[Document]]. During the load operation, this property method is invoked. It traverses on the data provided by us (in this case, a text file(s)) usingreader. Then itreads,creates chunks, andencapsulates each chunk inside aDocument object and yieldslists ofDocument objects that contain our data.
Thenembeddings are created on these chunks areinserted into the LanceDB Table
How do they insert your data as different rows in LanceDB Table?
The chunks of your data are in the form -lists ofDocument objects. It was yielded in the step above.
for eachDocument inList[Document], it does the following operations:
- Creates embedding onDocument.
- Cleans thecontent attribute(chunks of our data is here) ofDocument.
- Prepares data by creatingid and loadingpayload with the metadata related to this chunk. (1)
  1. Three columns will be added to the table -"id","vector", and"payload" (payload contains various metadata includingcontent)
- Then add this data to LanceTable.
Now the internal state ofknowledge_base is changed (embeddings are created and loaded in the table ) and itready to be used by assistant.

Step 4 - Start a cli chatbot with access to the Knowledge base

# start cli chatbot with knowledge baseassistant.print_response("Ask me about something from the knowledge base")whileTrue:message=Prompt.ask(f"[bold] :sunglasses: User [/bold]")ifmessagein("exit","bye"):breakassistant.print_response(message,markdown=True)

For more information and amazing cookbooks of phidata, read thephidata documentation and also visitLanceDB x phidata docmentation.