Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Download Microsoft EdgeMore info about Internet Explorer and Microsoft Edge
Table of contentsExit editor mode

Get started with evaluating answers in a chat app in JavaScript

Feedback

In this article

This article shows you how to evaluate a chat app's answers against a set of correct or ideal answers (known as ground truth). Whenever you change your chat application in a way that affects the answers, run an evaluation to compare the changes. This demo application offers tools that you can use today to make it easier to run evaluations.

By following the instructions in this article, you:

  • Use provided sample prompts tailored to the subject domain. These prompts are already in the repository.
  • Generate sample user questions and ground truth answers from your own documents.
  • Run evaluations by using a sample prompt with the generated user questions.
  • Review analysis of answers.

Note

This article uses one or moreAI app templates as the basis for the examples and guidance in the article. AI app templates provide you with well-maintained reference implementations that are easy to deploy. They help to ensure a high-quality starting point for your AI apps.

Architectural overview

Key components of the architecture include:

  • Azure-hosted chat app: The chat app runs in Azure App Service.
  • Microsoft AI Chat Protocol: The protocol provides standardized API contracts across AI solutions and languages. The chat app conforms to theMicrosoft AI Chat Protocol, which allows the evaluations app to run against any chat app that conforms to the protocol.
  • Azure AI Search: The chat app uses Azure AI Search to store the data from your own documents.
  • Sample questions generator: The tool can generate many questions for each document along with the ground truth answer. The more questions there are, the longer the evaluations.
  • Evaluator: The tool runs sample questions and prompts against the chat app and returns the results.
  • Review tool: The tool reviews the results of the evaluations.
  • Diff tool: The tool compares the answers between evaluations.

When you deploy this evaluation to Azure, the Azure OpenAI Service endpoint is created for theGPT-4 model with its owncapacity. When you evaluate chat applications, it's important that the evaluator has its own Azure OpenAI resource by usingGPT-4 with its own capacity.

Prerequisites

  • Azure subscription.Create one for free

  • Deploy a chat app.

  • These chat apps load the data into the Azure AI Search resource. This resource is required for the evaluations app to work. Don't complete theClean up resources section of the previous procedure.

    You need the following Azure resource information from that deployment, which is referred to as thechat app in this article:

    • Chat API URI: The service backend endpoint shown at the end of theazd up process.
    • Azure AI Search. The following values are required:
      • Resource name: The name of the Azure AI Search resource name, reported asSearch service during theazd up process.
      • Index name: The name of the Azure AI Search index where your documents are stored. You can find the index name in the Azure portal for the Search service.

    The Chat API URL allows the evaluations to make requests through your backend application. The Azure AI Search information allows the evaluation scripts to use the same deployment as your backend, loaded with the documents.

    After you collect this information, you don’t need to use thechat app development environment again. This article refers to thechat app several times to show how theEvaluations app uses it. Don’t delete thechat app resources until you finish all steps in this article.

  • Adevelopment container environment is available with all dependencies required to complete this article. You can run the development container in GitHub Codespaces (in a browser) or locally using Visual Studio Code.

    • GitHub account

Open a development environment

Follow these instructions to set up a preconfigured development environment with all the required dependencies to complete this article. Arrange your monitor workspace so that you can see this documentation and the development environment at the same time.

This article was tested with theswitzerlandnorth region for the evaluation deployment.

GitHub Codespaces runs a development container managed by GitHub withVisual Studio Code for the Web as the user interface. Use GitHub Codespaces for the easiest development environment. It comes with the right developer tools and dependencies preinstalled to complete this article.

Important

All GitHub accounts can use GitHub Codespaces for up to 60 hours free each month with two core instances. For more information, seeGitHub Codespaces monthly included storage and core hours.

  1. Start the process to create a new GitHub codespace on themain branch of theAzure-Samples/ai-rag-chat-evaluator GitHub repository.

  2. To display the development environment and the documentation available at the same time, right-click the following button, and selectOpen link in new window.

    Open in GitHub Codespaces.

  3. On theCreate codespace page, review the codespace configuration settings, and then selectCreate new codespace.

    Screenshot that shows the confirmation screen before you create a new codespace.

  4. Wait for the codespace to start. This startup process can take a few minutes.

  5. In the terminal at the bottom of the screen, sign in to Azure with the Azure Developer CLI:

    azd auth login --use-device-code
  6. Copy the code from the terminal and then paste it into a browser. Follow the instructions to authenticate with your Azure account.

  7. Provision the required Azure resource, Azure OpenAI Service, for the evaluations app:

    azd up

    ThisAZD command doesn't deploy the evaluations app, but it does create the Azure OpenAI resource with a requiredGPT-4 deployment to run the evaluations in the local development environment.

The remaining tasks in this article take place in the context of this development container.

The name of the GitHub repository appears in the search bar. This visual indicator helps you distinguish the evaluations app from the chat app. Thisai-rag-chat-evaluator repo is referred to as theevaluations app in this article.

Prepare environment values and configuration information

Update the environment values and configuration information with the information you gathered duringPrerequisites for the evaluations app.

  1. Create a.env file based on.env.sample.

    cp .env.sample .env
  2. Run this command to get the required values forAZURE_OPENAI_EVAL_DEPLOYMENT andAZURE_OPENAI_SERVICE from your deployed resource group. Paste those values into the.env file.

    azd env get-value AZURE_OPENAI_EVAL_DEPLOYMENTazd env get-value AZURE_OPENAI_SERVICE
  3. Add the following values from the chat app for its Azure AI Search instance to the.env file, which you gathered in thePrerequisites section.

    AZURE_SEARCH_SERVICE="<service-name>"AZURE_SEARCH_INDEX="<index-name>"

Use the Microsoft AI Chat Protocol for configuration information

The chat app and the evaluations app both implement the Microsoft AI Chat Protocol specification, an open-source, cloud, and language-agnostic AI endpoint API contract that's used for consumption and evaluation. When your client and middle-tier endpoints adhere to this API specification, you can consistently consume and run evaluations on your AI backends.

  1. Create a new file namedmy_config.json and copy the following content into it:

    {    "testdata_path": "my_input/qa.jsonl",    "results_dir": "my_results/experiment<TIMESTAMP>",    "target_url": "http://localhost:50505/chat",    "target_parameters": {        "overrides": {            "top": 3,            "temperature": 0.3,            "retrieval_mode": "hybrid",            "semantic_ranker": false,            "prompt_template": "<READFILE>my_input/prompt_refined.txt",            "seed": 1        }    }}

    The evaluation script creates themy_results folder.

    Theoverrides object contains any configuration settings that are needed for the application. Each application defines its own set of settings properties.

  2. Use the following table to understand the meaning of the settings properties that are sent to the chat app.

    Settings propertyDescription
    semantic_rankerWhether to usesemantic ranker, a model that reranks search results based on semantic similarity to the user's query. We disable it for this tutorial to reduce costs.
    retrieval_modeThe retrieval mode to use. The default ishybrid.
    temperatureThe temperature setting for the model. The default is0.3.
    topThe number of search results to return. The default is3.
    prompt_templateAn override of the prompt used to generate the answer based on the question and search results.
    seedThe seed value for any calls to GPT models. Setting a seed results in more consistent results across evaluations.
  3. Change thetarget_url value to the URI value of your chat app, which you gathered in thePrerequisites section. The chat app must conform to the chat protocol. The URI has the following format:https://CHAT-APP-URL/chat. Make sure the protocol and thechat route are part of the URI.

Generate sample data

To evaluate new answers, they must be compared to aground truth answer, which is the ideal answer for a particular question. Generate questions and answers from documents that are stored in Azure AI Search for the chat app.

  1. Copy theexample_input folder into a new folder namedmy_input.

  2. In a terminal, run the following command to generate the sample data:

    python -m evaltools generate --output=my_input/qa.jsonl --persource=2 --numquestions=14

The question-and-answer pairs are generated and stored inmy_input/qa.jsonl (inJSONL format) as input to the evaluator that's used in the next step. For a production evaluation, you would generate more question-and-answer pairs. More than 200 are generated for this dataset.

Note

Only a few questions and answers are generated per source so that you can quickly complete this procedure. It isn't meant to be a production evaluation, which should have more questions and answers per source.

Run the first evaluation with a refined prompt

  1. Edit themy_config.json configuration file properties.

    PropertyNew value
    results_dirmy_results/experiment_refined
    prompt_template<READFILE>my_input/prompt_refined.txt

    The refined prompt is specific about the subject domain.

    If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. If asking a clarifying question to the user would help, ask the question.Use clear and concise language and write in a confident yet friendly tone. In your answers, ensure the employee understands how your response connects to the information in the sources and include all citations necessary to help the employee validate the answer provided.For tabular information, return it as an html table. Do not return markdown format. If the question is not in English, answer in the language used in the question.Each source has a name followed by a colon and the actual information. Always include the source name for each fact you use in the response. Use square brackets to reference the source, e.g. [info1.txt]. Don't combine sources, list each source separately, e.g. [info1.txt][info2.pdf].
  2. In a terminal, run the following command to run the evaluation:

    python -m evaltools evaluate --config=my_config.json --numquestions=14

    This script created a new experiment folder inmy_results/ with the evaluation. The folder contains the results of the evaluation.

    File nameDescription
    config.jsonA copy of the configuration file used for the evaluation.
    evaluate_parameters.jsonThe parameters used for the evaluation. Similar toconfig.json but includes other metadata like time stamp.
    eval_results.jsonlEach question and answer, along with the GPT metrics for each question-and-answer pair.
    summary.jsonThe overall results, like the average GPT metrics.

Run the second evaluation with a weak prompt

  1. Edit themy_config.json configuration file properties.

    PropertyNew value
    results_dirmy_results/experiment_weak
    prompt_template<READFILE>my_input/prompt_weak.txt

    That weak prompt has no context about the subject domain.

    You are a helpful assistant.
  2. In a terminal, run the following command to run the evaluation:

    python -m evaltools evaluate --config=my_config.json --numquestions=14

Run the third evaluation with a specific temperature

Use a prompt that allows for more creativity.

  1. Edit themy_config.json configuration file properties.

    ExistingPropertyNew value
    Existingresults_dirmy_results/experiment_ignoresources_temp09
    Existingprompt_template<READFILE>my_input/prompt_ignoresources.txt
    Newtemperature0.9

    The defaulttemperature is 0.7. The higher the temperature, the more creative the answers.

    Theignore prompt is short.

    Your job is to answer questions to the best of your ability. You will be given sources but you should IGNORE them. Be creative!
  2. The configuration object should look like the following example, except that you replacedresults_dir with your path:

    {    "testdata_path": "my_input/qa.jsonl",    "results_dir": "my_results/prompt_ignoresources_temp09",    "target_url": "https://YOUR-CHAT-APP/chat",    "target_parameters": {        "overrides": {            "temperature": 0.9,            "semantic_ranker": false,            "prompt_template": "<READFILE>my_input/prompt_ignoresources.txt"        }    }}
  3. In a terminal, run the following command to run the evaluation:

    python -m evaltools evaluate --config=my_config.json --numquestions=14

Review the evaluation results

You performed three evaluations based on different prompts and app settings. The results are stored in themy_results folder. Review how the results differ based on the settings.

  1. Use the review tool to see the results of the evaluations.

    python -m evaltools summary my_results
  2. The results looksomething like:

    Screenshot that shows the evaluations review tool showing the three evaluations.

    Each value is returned as a number and a percentage.

  3. Use the following table to understand the meaning of the values.

    ValueDescription
    GroundednessChecks how well the model's responses are based on factual, verifiable information. A response is considered grounded if it's factually accurate and reflects reality.
    RelevanceMeasures how closely the model's responses align with the context or the prompt. A relevant response directly addresses the user's query or statement.
    CoherenceChecks how logically consistent the model's responses are. A coherent response maintains a logical flow and doesn't contradict itself.
    CitationIndicates if the answer was returned in the format requested in the prompt.
    LengthMeasures the length of the response.
  4. The results should indicate that all three evaluations had high relevance while theexperiment_ignoresources_temp09 had the lowest relevance.

  5. Select the folder to see the configuration for the evaluation.

  6. EnterCtrl +C to exit the app and return to the terminal.

Compare the answers

Compare the returned answers from the evaluations.

  1. Select two of the evaluations to compare, and then use the same review tool to compare the answers.

    python -m evaltools diff my_results/experiment_refined my_results/experiment_ignoresources_temp09
  2. Review the results. Your results might vary.

    Screenshot that shows comparison of evaluation answers between evaluations.

  3. EnterCtrl +C to exit the app and return to the terminal.

Suggestions for further evaluations

  • Edit the prompts inmy_input to tailor the answers such as subject domain, length, and other factors.
  • Edit themy_config.json file to change the parameters such astemperature, andsemantic_ranker and rerun experiments.
  • Compare different answers to understand how the prompt and question affect the answer quality.
  • Generate a separate set of questions and ground truth answers for each document in the Azure AI Search index. Then rerun the evaluations to see how the answers differ.
  • Alter the prompts to indicate shorter or longer answers by adding the requirement to the end of the prompt. An example isPlease answer in about 3 sentences.

Clean up resources and dependencies

The following steps walk you through the process of cleaning up the resources you used.

Clean up Azure resources

The Azure resources created in this article are billed to your Azure subscription. If you don't expect to need these resources in the future, delete them to avoid incurring more charges.

To delete the Azure resources and remove the source code, run the following Azure Developer CLI command:

azd down --purge

Clean up GitHub Codespaces and Visual Studio Code

Deleting the GitHub Codespaces environment ensures that you can maximize the amount of free per-core hours entitlement that you get for your account.

Important

For more information about your GitHub account's entitlements, seeGitHub Codespaces monthly included storage and core hours.

  1. Sign in to theGitHub Codespaces dashboard.

  2. Locate your currently running codespaces that are sourced from theAzure-Samples/ai-rag-chat-evaluator GitHub repository.

    Screenshot that shows all the running codespaces, including their status and templates.

  3. Open the context menu for the codespace, and then selectDelete.

    Screenshot that shows the context menu for a single codespace with the Delete option highlighted.

Return to the chat app article to clean up those resources.

Related content


Feedback

Was this page helpful?

YesNoNo

Need help with this topic?

Want to try using Ask Learn to clarify or guide you through this topic?

Suggest a fix?

  • Last updated on

In this article

Was this page helpful?

YesNo
NoNeed help with this topic?

Want to try using Ask Learn to clarify or guide you through this topic?

Suggest a fix?