YouTube transcript QA bot with NodeJS

use LanceDB's Javascript API and OpenAI to build a QA bot for YouTube transcripts

nodejs

This Q&A bot will allow you to search through youtube transcripts using natural language! We'll introduce how to use LanceDB's Javascript API to store and manage your data easily.

npminstallvectordb

Download the data

For this example, we're using a sample of a HuggingFace dataset that contains YouTube transcriptions:jamescalam/youtube-transcriptions. Download and extract this file under thedata folder:

wget-chttps://eto-public.s3.us-west-2.amazonaws.com/datasets/youtube_transcript/youtube-transcriptions_sample.jsonl

Prepare Context

Each item in the dataset contains just a short chunk of text. We'll need to merge a bunch of these chunks together on a rolling basis. For this demo, we'll look back 20 records to create a more complete context for each sentence.

First, we need to read and parse the input file.

constlines=(awaitfs.readFile(INPUT_FILE_NAME,'utf-8')).toString().split('\n').filter(line=>line.length>0).map(line=>JSON.parse(line))constdata=contextualize(lines,20,'video_id')

The contextualize function groups the transcripts by video_id and then creates the expanded context for each item.

functioncontextualize(rows,contextSize,groupColumn){constgrouped=[]rows.forEach(row=>{if(!grouped[row[groupColumn]]){grouped[row[groupColumn]]=[]}grouped[row[groupColumn]].push(row)})constdata=[]Object.keys(grouped).forEach(key=>{for(leti=0;i<grouped[key].length;i++){conststart=i-contextSize>0?i-contextSize:0grouped[key][i].context=grouped[key].slice(start,i+1).map(r=>r.text).join(' ')}data.push(...grouped[key])})returndata}

Create the LanceDB Table

To load our data into LanceDB, we need to create embedding (vectors) for each item. For this example, we will use the OpenAI embedding functions, which have a native integration with LanceDB.

// You need to provide an OpenAI API key, here we read it from the OPENAI_API_KEY environment variableconstapiKey=process.env.OPENAI_API_KEY// The embedding function will create embeddings for the 'context' columnconstembedFunction=newlancedb.OpenAIEmbeddingFunction('context',apiKey)// Connects to LanceDBconstdb=awaitlancedb.connect('data/youtube-lancedb')consttbl=awaitdb.createTable('vectors',data,embedFunction)

Create and answer the prompt

We will accept questions in natural language and use our corpus stored in LanceDB to answer them. First, we need to set up the OpenAI client:

constconfiguration=newConfiguration({apiKey})constopenai=newOpenAIApi(configuration)

Then we can prompt questions and use LanceDB to retrieve the three most relevant transcripts for this prompt.

constquery=awaitrl.question('Prompt: ')constresults=awaittbl.search(query).select(['title','text','context']).limit(3).execute()

The query and the transcripts' context are appended together in a single prompt:

functioncreatePrompt(query,context){letprompt='Answer the question based on the context below.\n\n'+'Context:\n'// need to make sure our prompt is not larger than max sizeprompt=prompt+context.map(c=>c.context).join('\n\n---\n\n').substring(0,3750)prompt=prompt+`\n\nQuestion:${query}\nAnswer:`returnprompt}

We can now use the OpenAI Completion API to process our custom prompt and give us an answer.

constresponse=awaitopenai.createCompletion({model:'text-davinci-003',prompt:createPrompt(query,results),max_tokens:400,temperature:0,top_p:1,frequency_penalty:0,presence_penalty:0})console.log(response.data.choices[0].text)

Let's put it all together now

Now we can provide queries and have them answered based on your local LanceDB data.

Prompt:whowasthe12thpersononthemoonandwhendidtheyland?The12thpersononthemoonwasHarrisonSchmittandhelandedonDecember11,1972.Prompt:WhichtrainingmethodshouldIuseforsentencetransformerswhenIonlyhavepairsofrelatedsentences?NLIwithmultiplenegativerankingloss.

That's a wrap

In this example, you learned how to use LanceDB to store and query embedding representations of your local data. The complete example code is onGitHub, and you can also download the LanceDB dataset usingthis link.

Movatterモバイル変換