Movatterモバイル変換


[0]ホーム

URL:


Sharon Machlis
Contributing Writer

How to create your own RAG applications in R

how-to
Jul 17, 202520 mins

See how to query documents using natural language, LLMs, and R—including dplyr-like filtering on metadata. Plus, learn how to use an LLM to extract structured data for text filtering.

RAG, Retrieval Augmented Generation, AI
Credit: Koshiro K – shutterstock.com

One of the handiest taskslarge language models can do for us is answer questions about a specific collection of information. This is often done using a technique called RAG, orretrieval augmented generation. Instead of relying on what the model knows from its training data, a RAG application searches for the most relevant parts of a document collection, then uses only those text chunks as context for the LLM’s response.

Now, thanks to some relatively new R packages, it’s easy to create your own RAG applications in R. You can even combine RAG with conventionaldplyr-like filtering to make responses more relevant, although that requires additional setup and code.

This tutorial gets you started creating RAG applications in R. First, we’ll cover how to prepare, chunk, store, and query a document with basic RAG, using information aboutWorkshops for Ukraine for our demo. You’ll quickly be able to ask general questions like “Tell me three workshops that would help me improve my R data visualization skills” and get a relevant response. Next, we’ll layer on some pre-filtering to answer slightly more specific questions like “What R-related workshops are happening next month?”

More about ragnar

See my introduction to3 of the best LLM integration tools for R for an overview of RAG for R.

The 5 steps of building a RAG app

There are five basic steps for building a RAG application with theragnar andellmer R packages:

  1. Turn documents into a markdown format thatragnar can process.
  2. Split the markdown text into chunks, optionally adding any metadata you might want to filter on (we won’t do the optional part yet).
  3. Create aragnar data store and insert your markdown chunks into the store. That insertion process automatically includes adding embeddings with each chunk (embeddings use a lengthy string of numbers to represent a text chunk’s semantic meaning).
  4. Embed a query and retrieve text chunks that are most relevant to that query.
  5. Send those chunks along with the original query to an LLM and ask the model to generate a response.

Let’s get started!

Set up your development environment

To start, you will need to install theellmer andragnar packages if you want to follow the examples.ellmer is the main tidyverse R package for using large language models in R.ragnar is specifically designed for RAG and works withellmer.

I suggest installing the latest development versions of both—especiallyragnar, since useful new features are being added somewhat frequently. You can do that withpak::pak("tidyverse/ragnar") andpak::pak("tidyverse/ellmer"). I’m also using thedplyr,purrr,stringr, andrio R packages, which can all be installed from CRAN withinstall.packages().

I’ll be usingOpenAI models both to generate embeddings and ask questions, so you’ll need an OpenAI API key to use the example code. If you want to use an Anthropic or Google Gemini model to generate the final answers, you’ll also need an API key from that provider. While it’s possible to run the example with a local LLMusing ollama, your results may not be as good.

ragnar updates

ragnar added a new data store architecture just prior to publication in July 2025, to support more sophisticated text chunking and retrieval. Thanks to package creator Tomasz Kalinowski at Posit for his help updating some of the code in this article.

Steps 1 and 2: Wrangle the ‘Workshops for Ukraine’ data

Workshops for Ukraine is a two-hour data science webinar series where volunteers teach a specific topic or skill, often R-related. The goal is to raise money for Ukraine, so participants donate at least $20 or €20 to one of several charities. Participants can attend live or get access to past recordings and materials.

The workshops are listed ona single web page hosted on Google Sites. Our first task is to import the web page usingragnar, which includes several functions for importing web pages and other document formats such as PDFs, Word, and Excel.

In the code below,read_as_markdown() converts the web page into markdown, thenmarkdown_chunk() splits that into chunks. Thesegment_by_heading_levels = 3 argument splits the text using the original HTML H3 headers, so that each new row is a workshop.

library(ragnar)library(dplyr, warn.conflicts = FALSE)library(stringr)workshop_url <- "https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine"ukraine_chunks <- workshop_url |>  read_as_markdown() |>  markdown_chunk(    target_size = NA,    segment_by_heading_levels = 3  ) |>  filter(str_starts(text, "### "))

Why did I use H3s to split the HTML text? Because I examined the workshop HTML page structure, and it looked like each workshop had its own H3 HTML header. Always check the format, because other web pages may have a different format.

The final filter deletes any rows without a level-3 heading, because those aren’t workshops.

Ragnar demo app data frame generated by the read_as_markdown() and markdown_chunk() functions.

Figure 1: Data frame generated by the read_as_markdown() and markdown_chunk() functions.

The resulting data frame has columns for text, context (header and potentially other information), and start and end locations. The start and end locations helpragnar handle chunk overlapping, which can help retain semantic meaning across text segments.

Step 3: Create a data store and insert chunks

Now I’m ready to create a data store and add my chunks. The code below creates a simpleragnar data store that is set up to use OpenAI’stext-embedding-3-small model when creating embeddings for each chunk. Theembed_ollama() instructs the app to use a localollama embedding model if one is installed on your system.ragnar uses DuckDB for its data store database.

store_file_location <- "ukraine_workshops.duckdb"store <- ragnar_store_create(  store_file_location,  embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"))

To add chunks to a store, use the syntax:ragnar_store_insert(store_object, chunk_dataframe). This single line of code saves the chunks, generates embeddings, and saves the embeddings:

ragnar_store_insert(store, ukraine_chunks)

If you’re having problems with any of this, as I did initially, make sure you have the latest version of theduckdb R package installed. I ended up having to uninstall it completely and reinstall withpak::pak("duckdb").

There’s one more quick step before you can use the store: Build the search index with theragnar_store_build_index(store_object) function. Don’t forget this part, or you may find yourself wondering later why you’re not seeing any search results:

ragnar_store_build_index(store)

If you want to know what your store looks like,ragnar has a built-in function for inspecting the store in a browser:ragnar_store_inspect(store).

The ragnar package’s ragnar_store_inspect() function lets you view and search a data store.

Figure 2: The ragnar package’sragnar_store_inspect() function lets you view and search a data store.

ragnar comes with two search algorithms by default: BM25 and VSS. BM25 looks for close matches; e.g., “graphs” should match “graph” but not “plots” or “visualizations.”

VSS uses semantic similarity, so in theory, “graph” and “plot” should also match “data visualization.” The similarity matching may not always be as smart as you’d like, so try adding synonyms to your query if you need better results.

Despite the limitations of VSS, I preferragnar_retrieve_vss() when working with small text chunks, since there’s a reasonable risk that a workshop may talk about “visualization” and “plots” while a query may only say “graphs.” If you want both VSS and BM25,raganr_retrieve() returns de-duplicated results from both algorithms.

In addition to usingragnar_store_inspect() to view the data store, you can also query a ragnar data store as you would any other DuckDB database in R. Here’s one way to do this:

chunks_df <- tbl(store@con, "chunks") |>  collect()

You can close the store connection withDBI::dbDisconnect(store@con), which is a good habit to get into since DuckDB can get finicky if you leave a database write connection open.

Step 4: Query your data store

Now we’re at the fun part!

We can use the store we’ve just created to retrieve text chunks related to a query. If you don’t already have a store connection in your Rsession, useragnar_store_connect() to connect to the DuckDB file:

store <- ragnar::ragnar_store_connect("ukraine_workshops.duckdb", read_only = TRUE)

The following code retrieves the six chunks deemed most relevant to a query using VSS semantic searching (top_k sets how many chunks are returned):

query <- "What workshops would help me improve my R data visualization skills?"similarity_chunks <- ragnar_retrieve_vss(store, query, top_k = 6)

Ragnar demo app showing the structure of data returned by basic ragnar retrieval.

Figure 3: Structure of data returned by a basicragnarretrieval.

When I ran this code, I got back the following list of workshops: “Effective Data Visualization in R in Scientific Contexts,” “Advanced data visualization in R with ggplot,” “Effective Visual Communication with R,” “Data Visualization with ggplot,” “Color Palette Choice and Customization in R and ggplot2,” and “Efficient R – How to write faster code.” Most of these look pretty relevant to me.

You can easily view all the text in thesimilarity_chunks results data frame by entering something like the following R code:

similarity_chunks$text |> cat(sep = "\n=====\n")

Step 5: Generate your answer

The final step is to send the query and retrieved text chunks to an LLM, with instructions to use the retrieved text chunks to generate an answer.

The tidyverse way to do this is to register yourragnar data store as a tool for anellmer chat. (Tools are functions that LLMs can access to give them additional capabilities. Theellmer documentation has a good overview of LLM tool calling.)

Theragnar_register_tool_retrieve() function is the easiest way to do a basic retrieval. Here’s an example:

# Create a chat objectlibrary(ellmer)my_chat <- chat_openai(  system_prompt = "You are a helpful assistant who answers questions about Workshops for Ukraine. You use available tools to answer questions and do not use your own existing knowledge.",  model = "gpt-4.1")# Register your store as a tool, setting your desired number of chunks to returnragnar_register_tool_retrieve(my_chat, store, top_k = 6)# Ask your questionmy_chat$chat("What workshops would help me improve my R data visualization skills?")

Here’s a look at the results in the console:

Ragnar demo app showing results when using ellmer to query a ragnar store in the console.

Figure 4: Results when usingellmer to query aragnar store in the console.

Themy_chat$chat() runs the chat object’s chat method and returns results to your console. If you want a web chatbot interface instead, you can runellmer‘slive_browser() function on your chat object, which can be handy if you want to ask multiple questions:live_browser(my_chat).

Ragnar demo app showing ellmer's built-in simple web chatbot interface.

Figure 5: Results inellmer’s built-in simple web chatbot interface.

Add metadata filtering to the RAG R app

Basic RAG worked pretty well when I asked about topics, but not for questions involving time. Asking about workshops “next month”–even when I told the LLM the current date–didn’t return the correct workshops.

That’s because this basic RAG is just looking for text that’smost similar to a question. If you ask “What R data visualization events are happening next month?”, you might end up with a workshop in three months. Basic semantic search often misses required elements, which is why we have metadata filtering.

Metadata filtering “knows” what is essential to a query–at least if you’ve set it up that way. This type of filtering lets you specify that chunksmust match certain requirements, such as a date range, and then performs semantic searchonly on those chunks. The items that don’t match your must-haves won’t be included.

To turn basicragnar RAG code into a RAG app with metadata filtering, you need to add metadata as separate columns in yourragnar data store and make sure an LLM knows how and when to use that information.

For this example, we’ll need to do the following:

  1. Get the date of each workshop and add it as a column to the original text chunks.
  2. Create a data store that includes a date column.
  3. Create a customragnar retrieval tool that tells the LLM how to filter for dates if the user’s query includes a time component.

Let’s get to it!

Step 1: Add the new metadata

If you’re lucky, your data already has the metadata you want in a structured format. Alas, no such luck here, since the Workshops for Ukraine listings are HTML text. How can we get the date of each future workshop?

It’s possible to do some metadata parsing with regular expressions. But if you’re interested in using generative AI with R, it’s worth knowing how to ask LLMs to extract structured data. Let’s take a quick detour for that.

We can request structured data withellmer‘sparallel_chat_structured() in three steps:

  • Define the structure we want.
  • Create prompts.
  • Send those prompts to an LLM.

We can extract the workshop title with a regex—an easy task since all the titles start with### and end with a line break:

ukraine_chunks <- ukraine_chunks |>  mutate(title = str_extract(text, "^### (.+)\n", 1))

Define the desired structure

The first thing we’ll do is define the metadata structure we want an LLM to return for each workshop item. Most important is the date, which will be flagged as not required since past workshops didn’t include them.ragnar creator Tomasz Kalinowski suggests we also include the speaker and speaker affiliation, which seems useful. We can save the resulting metadata structure as anellmer “TypeObject” template:

type_workshop_metadata <- type_object(  date = type_string(    paste(      "Date in yyyy-mm-dd format if it's an upcoming workshop, otherwise an empty string."    )  ),  speaker_name = type_string(),  speaker_affiliations = type_string(    "comma seperated listing of current and former affiliations listed in reverse chronological order"  ))

Create prompts to request that structured data

The code below usesellmer‘sinterpolate() function to create a vector of prompts using that template, one for each text chunk:

prompts <- interpolate(  "Extract the data for the workshops mentioned in the text below.   Include the Date ONLY if it is a future workshop with a specific date (today is {{Sys.Date()}}). The Date must be in yyyy-mm-dd format.  If the year is not included in the date, start by assuming the workshop is in the next 12 months and set the year accordingly.  Next, find the day of week mentioned in the text and make sure the day-date combination exists! For example, if a workshop says 'Thursday, August 30' and you set the date to 2025-08-30, check if 2025-08-30 is on Thursday. If it isn't, set the date to null.  {{ ukraine_chunks$text }}  ")

Send all the prompts to an LLM

This next bit of code creates a chat object and then usesparallel_chat_structured() to run all the prompts. Thechat andprompts vector are required arguments. In this case, I also dialed back the default numbers of active requests and requests per minute with themax_active andrpm arguments so I didn’t hit my API limits (which often happens on my OpenAI account at the defaults):

chat <- ellmer::chat_openai(model = "gpt-4.1")extracted <- parallel_chat_structured(  chat = chat,  prompts = prompts,  max_active = 4,  rpm = 100,  type = type_workshop_metadata)

Finally, we add the extracted results to theukraine_chunks data frame and save those results. That way, we won’t need to re-run all the code later if we need this data again:

ukraine_chunks <- ukraine_chunks |>  mutate(!!!extracted,          date = as.Date(date))rio::export(ukraine_chunks, "ukraine_workshop_data_results.parquet")

If you’re unfamiliar with thesplice operator (!!! in the above code), it’s unpacking individual columns in the extracted data frame and adding them as new columns toukraine_chunks via themutate() function.

Theukraine_chunks data frame now has the columns start, end, context, text, title, date, speaker_name, and speaker_affiliations.

I still ended up with a few old dates in my data. Since this tutorial’s main focus is RAG and not optimizing data extraction, I’ll call this good enough. As long as the LLM figured out that a workshop on “Thursday, September 12” wasn’t this year, we can delete past dates the old-fashioned way:

ukraine_chunks <- ukraine_chunks |>  mutate(date = if_else(date >= Sys.Date(), date, NA))

We’ve got the metadata we need, structured how we want it. The next step is to set up the data store.

Step 2: Set up the data store with metadata columns

We want theragnar data store to have columns fortitle,date,speaker_name, andspeaker_affiliations, in addition to the defaults.

To add extra columns to a version data store, you first create an empty data frame with the extra columns you want, and then use that data frame as an argument when creating the store. This process is simpler than it sounds, as you can see below:

my_extra_columns <- data.frame(  title = character(),  date = as.Date(character()),  speaker_name = character(),  speaker_affiliations = character())store_file_location <- "ukraine_workshop_w_metadata.duckdb"store <- ragnar_store_create(  store_file_location,  embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),  # overwrite = TRUE,  extra_cols = my_extra_columns)

Inserting text chunks from the metadata-augmented data frame into aragnar data store is the same as before, usingragnar_store_insert() andragnar_store_build_index():

ragnar_store_insert(store, ukraine_chunks)ragnar_store_build_index(store)

If you’re trying toupdate existing items in a store instead of inserting new ones, you can useragnar_store_update(). That should check the hash to see if the entry exists and whether it has been changed.

Step 3: Create a custom ragnar retrieval tool

As far as I know, you need to register a custom tool withellmer when doing metadata filtering instead of usingragnar‘s simpleragnar_register_tool_retrieve(). You can do this by:

  • Creating an R function
  • Turning that function into a tool definition
  • Registering the tool with a chat object’sregister_tool() method

First, you will write a conventional R function. The function below adds filtering if a start and/or end date are not NULL, and then performs chunk retrieval. It requires a store to be in your global environment—don’t usestore as an argument in this function; it won’t work.

This function first sets up a filter expression, depending on whether dates are specified, and then adds the filter expression as an argument to aragnar retrieval function. Adding filtering toragnar_retrieve() functions is a new feature as of this writing in July 2025.

Below is the function largely suggested by Tomasz Kalinowski. Here we’re usingragnar_retrieve() to get both conventional and semantic search, instead of just VSS searching. I added “data-related” as the default query so the function can also handle time-related questions without a topic:

retrieve_workshops_filtered <- function(  query = "data-related",  start_date = NULL,  end_date = NULL,  top_k = 8) {  # Build filter expression based on provided dates  if (!is.null(start_date) && !is.null(end_date)) {    # Both dates provided    start_date <- as.Date(start_date)    end_date <- as.Date(end_date)    filter_expr <- rlang::expr(between(      date,      !!as.Date(start_date),      !!as.Date(end_date)    ))  } else if (!is.null(start_date)) {    # Only start date    filter_expr <- rlang::expr(date >= !!as.Date(start_date))  } else if (!is.null(end_date)) {    # Only end date    filter_expr <- rlang::expr(date <= !!as.Date(end_date))  } else {    # no filter    filter_expr <- NULL  }  # Perform retrieval  ragnar_retrieve(    store,    query,    top_k = top_k,    filter = !!filter_expr  ) |>    select(title, date, speaker_name, speaker_affiliations, text)}

Next, create a tool forellmer based on that function usingtool(), which needs the function name and a tool definition as arguments. The definition is important because the LLM uses it to decide whether or not to use the tool to answer a question:

workshop_retrieval_tool <- tool(  retrieve_workshops_filtered,  "Retrieve workshop information based on content query and optional date filtering. Only returns workshops that match both the content query and date constraints.",  query = type_string(    "The search query describing what kind of workshop content you're looking for (e.g., 'data visualization', 'data wrangling')"  ),  start_date = type_string(    "Optional start date in YYYY-MM-DD format. Only workshops on or after this date will be returned.",    required = FALSE  ),  end_date = type_string(    "Optional end date in YYYY-MM-DD format. Only workshops on or before this date will be returned.",    required = FALSE  ),  top_k = type_integer(    "Number of workshops to retrieve (default: 6)",    required = FALSE  ))

Now create anellmer chat with a system prompt to help the LLM know when to use the tool. Then register the tool and try it out! My example is below.

my_system_prompt <- paste0(  "You are a helpful assistant who only answers questions about Workshops for Ukraine from provided context. Do not also use your own existing knowledge.",  "Use the retrieve_workshops_filtered tool to search for workshops and workshop information. ",  "When users mention time periods like 'next month', 'this month', 'upcoming', etc., ",  "convert these to specific YYYY-MM-DD date ranges and pass them to the tool. ",  "Past workshops do not have Date entries so would be NULL or NA",  "Today's date is ",  Sys.Date(),  ". ",  "If no workshops match the criteria, let the user know.")my_chat <- chat_openai(  system_prompt = my_system_prompt,  model = "gpt-4.1",  params = params(temperature = 0.3))# Register the toolmy_chat$register_tool(workshop_retrieval_tool)# Test it outmy_chat$chat("What R-related workshops are happening next month?")

If there are indeed any R-related workshops next month, you should get the correct answer, thanks to your new advanced RAG app built entirely in R. You can also create a local chatbot interface withlive_browser(my_chat).

And, once again, it’s good practice to close your connection when you’re finished withDBI::dbDisconnect(store@con).

That’s it for this demo, but there’s a lot more you can do with R and RAG. Do you want a better interface, or one you can share? Thissample R Shiny web app, written primarily by Claude Opus, might give you some ideas.

Sharon Machlis

Sharon Machlis was a longtime writer and editor at Computerworld and later the Director of Editorial Data & Analytics at parent company Foundry. She is also the author ofPractical R for Mass Communications and Journalism.

Sharon'sDo more with R video tutorials won aJesse H. Neal award for Best Instructional Content.

Recently retired, Sharon is still passionate about R and generative AI, and alsoblogs about the retirement life. You can find her on Bluesky at@smachlis.bsky.social, Mastodon at@smach@masto.machlis.com, andLinkedIn.

More from this author

Show me more

Sponsored Links


[8]ページ先頭

©2009-2025 Movatter.jp