Movatterモバイル変換


[0]ホーム

URL:


Standard Tools

Overview

Inspect has several standard tools built-in, including:

  • Web Search, which uses a search provider (either built in to the model or external) to execute and summarize web searches.

  • Bash and Python for executing arbitrary shell and Python code.

  • Bash Session for creating a stateful bash shell that retains its state across calls from the model.

  • Text Editor which enables viewing, creating and editing text files.

  • Web Browser, which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions.

  • Computer, which provides the model with a desktop computer (viewed through screenshots) that supports mouse and keyboard interaction.

  • Think, which provides models the ability to include an additional thinking step as part of getting to its final answer.

Web Search

Theweb_search() tool provides models the ability to enhance their context window by performing a search. Web searches are executed using a provider. Providers are split into two categories:

  • Internal providers:"openai","anthropic","gemini", and"perplexity" - these use the model’s built-in search capability and do not require separate API keys. These work only for their respective model provider (e.g. the “openai” search provider works only foropenai/* models).

  • External providers:"tavily","exa", and"google". These are external services that work with any model and require separate accounts and API keys. Note that “google” is different from “gemini” - “google” refers to Google’s Programmable Search Engine service, while “gemini” refers to Google’s built-in search capability for Gemini models.

Internal providers will be prioritized if running on the corresponding model (e.g., “openai” provider will be used when running onopenai models). If an internal provider is specified but the evaluation is run with a different model, a fallback external provider must also be specified.

You can configure theweb_search() tool in various ways:

from inspect_ai.toolimport web_search# single providerweb_search("tavily")# internal provider and fallbackweb_search(["openai","tavily"])# multiple internal providers and fallbackweb_search(["openai","anthropic","gemini","perplexity","tavily"])# provider with specific optionsweb_search({"tavily": {"max_results":5}})# multiple providers with optionsweb_search({"openai":True,"google": {"num_results":5},"tavily": {"max_results":5}})

OpenAI Options

Theweb_search() tool can use OpenAI’s built-in search capability when running on a limited number of OpenAI models (currently “gpt-4o”, “gpt-4o-mini”, and “gpt-4.1”). This provider does not require any API keys beyond what’s needed for the model itself.

For more details on OpenAI’s web search parameters, seeOpenAI Web Search Documentation.

Note that when using the “openai” provider, you should also specify a fallback external provider (like “tavily”, “exa”, or “google”) if you are also running the evaluation with non-OpenAI model.

Anthropic Options

Theweb_search() tool can use Anthropic’s built-in search capability when running on a limited number of Anthropic models (currently “claude-opus-4-20250514”, “claude-sonnet-4-20250514”, “claude-3-7-sonnet-20250219”, “claude-3-5-sonnet-latest”, “claude-3-5-haiku-latest”). This provider does not require any API keys beyond what’s needed for the model itself.

For more details on Anthropic’s web search parameters, seeAnthropic Web Search Documentation.

Note that when using the “anthropic” provider, you should also specify a fallback external provider (like “tavily”, “exa”, or “google”) if you are also running the evaluation with non-Anthropic model.

Gemini Options

Theweb_search() tool can use Google’s built-in search capability (called grounding) when running on Gemini 2.0 models and later. This provider does not require any API keys beyond what’s needed for the model itself.

This is distinct from the “google” provider (described below), which uses Google’s external Programmable Search Engine service and requires separate API keys.

For more details, seeGrounding with Google Search.

Note that when using the “gemini” provider, you should also specify a fallback external provider (like “tavily”, “exa”, or “google”) if you are also running the evaluation with non-Gemini models.

Warning

Google’s search grounding does not currently support use with other tools. Attempting to useweb_search("gemini") alongside other tools will result in an error.

Perplexity Options

Theweb_search() tool can use Perplexity’s built-in search capability when running on Perplexity models. This provider does not require any API keys beyond what’s needed for the model itself. Search parameters can be passed using theperplexity provider options and will be forwarded to the model API.

For more details, seePerplexity API Documentation.

Note that when using the “perplexity” provider, you should also specify a fallback external provider (like “tavily”, “exa”, or “google”) if you are also running the evaluation with non-Perplexity models.

Tavily Options

Theweb_search() tool can useTavily’s Research API. To use it you will need to set up your own Tavily account. Then, ensure that the following environment variable is defined:

  • TAVILY_API_KEY — Tavily Research API key

Tavily supports the following options:

OptionDescription
max_resultsNumber of results to return
search_depthCan be “basic” or “advanced”
topicCan be “general” or “news”
include_domains /exclude_domainsLists of domains to include or exclude
time_rangeTime range for search results (e.g., “day”, “week”, “month”)
max_connectionsMaximum number of concurrent connections

For more options, see theTavily API Documentation.

Exa Options

Theweb_search() tool can useExa’s Answer API. To use it you will need to set up your own Exa account. Then, ensure that the following environment variable is defined:

  • EXA_API_KEY — Exa API key

Exa supports the following options:

OptionDescription
textWhether to include text content in citations (defaults to true)
modelLLM model to use for generating the answer (“exa” or “exa-pro”)
max_connectionsMaximum number of concurrent connections

For more details, see theExa API Documentation.

Google Options

Theweb_search() tool can useGoogle Programmable Search Engine as an external provider. This is different from the “gemini” provider (described above), which uses Google’s built-in search capability for Gemini models.

To use the “google” provider you will need to set up your own Google Programmable Search Engine and also enable theProgrammable Search Element Paid API. Then, ensure that the following environment variables are defined:

  • GOOGLE_CSE_ID — Google Custom Search Engine ID
  • GOOGLE_CSE_API_KEY — Google API key used to enable the Search API

Google supports the following options:

OptionDescription
num_resultsThe number of relevant webpages whose contents are returned
max_provider_callsNumber of times to retrieve more links in case previous ones were irrelevant (defaults to 3)
max_connectionsMaximum number of concurrent connections (defaults to 10)
modelModel to use to determine if search results are relevant (defaults to the model being evaluated)

Bash and Python

Thebash() andpython() tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of aSandbox Environment for the execution of untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:

from inspect_ai.toolimport bash, pythonCMD_TIMEOUT=180@taskdef intercode_ctf():return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            use_tools([                bash(CMD_TIMEOUT),                python(CMD_TIMEOUT)            ]),            generate(),        ],        scorer=includes(),        message_limit=30,        sandbox="docker",    )

We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.

See theAgents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.

Bash Session

Thebash_session() tool provides a bash shell that retains its state across calls from the model (as distinct from thebash() tool which executes each command in a fresh session). The prompt, working directory, and environment variables are all retained across calls. The tool also supports arestart action that enables the model to reset its state and work in a fresh session.

Note that a separate bash process is created within the sandbox for each instance of the bash session tool. See thebash_session() reference docs for details on customizing this behavior.

Configuration

Bash sessions require the use of aSandbox Environment for the execution of untrusted code. In addition, you’ll need some dependencies installed in the sandbox container. Please seeSandbox Dependencies below for additional instructions.

You should add the following to your sandboxDockerfile in order to use this tool:

RUNapt-get update&&apt-get install-y pipx&&\apt-get clean&&rm-rf /var/lib/apt/lists/*&&\pipx ensurepathENV PATH="$PATH:/root/.local/bin"RUNpipx install inspect-tool-support&&inspect-tool-support post-install

Note that Playwright (used for theweb_browser() tool) does not support some versions of Linux (e.g. Kali Linux). If this is the case for your Linux distribution, you should add the--no-web-browser option to thepost-install:

RUNinspect-tool-support post-install--no-web-browser

If you don’t have a custom Dockerfile, you can alternatively use the pre-builtaisiuk/inspect-tool-support image:

compose.yaml
services:default:image: aisiuk/inspect-tool-supportinit:true

Task Setup

A task configured to use the bash session tool might look like this:

from inspect_aiimport Task, taskfrom inspect_ai.scorerimport includesfrom inspect_ai.solverimport generate, system_message, use_toolsfrom inspect_ai.toolimport bash_session@taskdef intercode_ctf():return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            use_tools([bash_session(timeout=180)]),            generate(),        ],        scorer=includes(),        sandbox=("docker","compose.yaml")    )

Note that we provide atimeout for bash session commands (this is a best practice to guard against extremely long running commands).

Text Editor

Thetext_editor() tool enables viewing, creating and editing text files. The tool supports editing files within a protectedSandbox Environment so tasks that use the text editor should have a sandbox defined and configured as described below.

Configuration

The text editor tools requires the use of aSandbox Environment. In addition, you’ll need some dependencies installed in the sandbox container. Please seeSandbox Dependencies below for additional instructions.

You should add the following to your sandboxDockerfile in order to use this tool:

RUNapt-get update&&apt-get install-y pipx&&\apt-get clean&&rm-rf /var/lib/apt/lists/*&&\pipx ensurepathENV PATH="$PATH:/root/.local/bin"RUNpipx install inspect-tool-support&&inspect-tool-support post-install

Note that Playwright (used for theweb_browser() tool) does not support some versions of Linux (e.g. Kali Linux). If this is the case for your Linux distribution, you should add the--no-web-browser option to thepost-install:

RUNinspect-tool-support post-install--no-web-browser

If you don’t have a custom Dockerfile, you can alternatively use the pre-builtaisiuk/inspect-tool-support image:

compose.yaml
services:default:image: aisiuk/inspect-tool-supportinit:true

Task Setup

A task configured to use the text editor tool might look like this (note that this task is also configured to use thebash_session() tool):

from inspect_aiimport Task, taskfrom inspect_ai.scorerimport includesfrom inspect_ai.solverimport generate, system_message, use_toolsfrom inspect_ai.toolimport bash_session, text_editor@taskdef intercode_ctf():return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            use_tools([                bash_session(timeout=180),                text_editor(timeout=180)            ]),            generate(),        ],        scorer=includes(),        sandbox=("docker","compose.yaml")    )

Note that we provide atimeout for the bash session and text editor tools (this is a best practice to guard against extremely long running commands).

Tool Binding

The schema for thetext_editor() tool is based on the standard Anthropictext editor tool type. Thetext_editor() works with all models that support tool calling, but when using Claude, the text editor tool will automatically bind to the native Claude tool definition.

Web Browser

The web browser tools provides models with the ability to browse the web using a headless Chromium browser. Navigation, history, and mouse/keyboard interactions are all supported.

Configuration

Under the hood, the web browser is an instance ofChromium orchestrated byPlaywright, and runs in aSandbox Environment. In addition, you’ll need some dependencies installed in the sandbox container. Please seeSandbox Dependencies below for additional instructions.

Note that Playwright (used for theweb_browser() tool) does not support some versions of Linux (e.g. Kali Linux).

You should add the following to your sandboxDockerfile in order to use this tool:

RUNapt-get update&&apt-get install-y pipx&&\apt-get clean&&rm-rf /var/lib/apt/lists/*&&\pipx ensurepathENV PATH="$PATH:/root/.local/bin"RUNpipx install inspect-tool-support&&inspect-tool-support post-install

If you don’t have a custom Dockerfile, you can alternatively use the pre-builtaisiuk/inspect-tool-support image:

compose.yaml
services:default:image: aisiuk/inspect-tool-supportinit:true

Task Setup

A task configured to use the web browser tools might look like this:

from inspect_aiimport Task, taskfrom inspect_ai.scorerimport matchfrom inspect_ai.solverimport generate, use_toolsfrom inspect_ai.toolimport bash, python, web_browser@taskdef browser_task():return Task(        dataset=read_dataset(),        solver=[            use_tools([bash(), python()]+ web_browser()),            generate(),        ],        scorer=match(),        sandbox=("docker","compose.yaml"),    )

Unlike some other tool functions likebash(), theweb_browser() function returns a list of tools. Therefore, we concatenate it with a list of the other tools we are using in the call touse_tools().

Note that a separate web browser process is created within the sandbox for each instance of the web browser tool. See theweb_browser() reference docs for details on customizing this behavior.

Browsing

If you review the transcripts of a sample with access to the web browser tool, you’ll notice that there are several distinct tools made available for control of the web browser. These tools include:

ToolDescription
web_browser_go(url)Navigate the web browser to a URL.
web_browser_click(element_id)Click an element on the page currently displayed by the web browser.
web_browser_type(element_id)Type text into an input on a web browser page.
web_browser_type_submit(element_id, text)Type text into a form input on a web browser page and press ENTER to submit the form.
web_browser_scroll(direction)Scroll the web browser up or down by one page.
web_browser_forward()Navigate the web browser forward in the browser history.
web_browser_back()Navigate the web browser back in the browser history.
web_browser_refresh()Refresh the current page of the web browser.

The return value of each of these tools is aweb accessibility tree for the page, which provides a clean view of the content, links, and form fields available on the page (you can look at the accessibility tree for any web page usingChrome Developer Tools).

Disabling Interactions

You can use the web browser tools with page interactions disabled by specifyinginteractive=False, for example:

use_tools(web_browser(interactive=False))

In this mode, the interactive tools (web_browser_click(),web_browser_type(), andweb_browser_type_submit()) are not made available to the model.

Computer

Thecomputer() tool provides models with a computer desktop environment along with the ability to view the screen and perform mouse and keyboard gestures.

The computer tool works with any model that supports image input. It also binds directly to the internal computer tool definitions for Anthropic and OpenAI models tuned for computer use (currentlyanthropic/claude-3-7-sonnet-latest andopenai/computer-use-preview).

Configuration

Thecomputer() tool runs within a Docker container. To use it with a task you need to reference theaisiuk/inspect-computer-tool image in your Docker compose file. For example:

compose.yaml
services:default:image: aisiuk/inspect-computer-tool

You can configure the container to not have Internet access as follows:

compose.yaml
services:default:image: aisiuk/inspect-computer-toolnetwork_mode: none

Note that if you’d like to be able to view the model’s interactions with the computer desktop in realtime, you will need to also do some port mapping to enable a VNC connection with the container. See theVNC Client section below for details on how to do this.

Theaisiuk/inspect-computer-tool image is based on theubuntu:22.04 image and includes the following additional applications pre-installed:

  • Firefox
  • VS Code
  • Xpdf
  • Xpaint
  • galculator

Task Setup

A task configured to use the computer tool might look like this:

from inspect_aiimport Task, taskfrom inspect_ai.scorerimport matchfrom inspect_ai.solverimport generate, use_toolsfrom inspect_ai.toolimport computer@taskdef computer_task():return Task(        dataset=read_dataset(),        solver=[            use_tools([computer()]),            generate(),        ],        scorer=match(),        sandbox=("docker","compose.yaml"),    )

To evaluate the task with models tuned for computer use:

inspect eval computer.py--model anthropic/claude-3-7-sonnet-latestinspect eval computer.py--model openai/computer-use-preview

Options

The computer tool supports the following options:

OptionDescription
max_screenshotsThe maximum number of screenshots to play back to the model as input. Defaults to 1 (set toNone to have no limit).
timeoutTimeout in seconds for computer tool actions. Defaults to 180 (set toNone for no timeout).

For example:

solver=[    use_tools([computer(max_screenshots=2, timeout=300)]),    generate()]

Examples

Two of the Inspect examples demonstrate basic computer use:

  • computer — Three simple computing tasks as a minimal demonstration of computer use.

    inspect eval examples/computer
  • intervention — Computer task driven interactively by a human operator.

    inspect eval examples/intervention-T mode=computer--display conversation

VNC Client

You can use aVNC connection to the container to watch computer use in real-time. This requires some additional port-mapping in the Docker compose file. You can define dynamic port ranges for VNC (5900) and a browser based noVNC client (6080) with the followingports entries:

compose.yaml
services:default:image: aisiuk/inspect-computer-toolports:-"5900"-"6080"

To connect to the container for a given sample, locate the sample in theRunning Samples UI and expand the sample info panel at the top:

Click on the link for the noVNC browser client, or use a native VNC client to connect to the VNC port. Note that the VNC server will take a few seconds to start up so you should give it some time and attempt to reconnect as required if the first connection fails.

The browser based client provides a view-only interface. If you use a native VNC client you should also set it to “view only” so as to not interfere with the model’s use of the computer. For example, for Real VNC Viewer:

Approval

If the container you are using is connected to the Internet, you may want to configure human approval for a subset of computer tool actions. Here are the possible actions (specified using theaction parameter to thecomputer tool):

  • key: Press a key or key-combination on the keyboard.
  • type: Type a string of text on the keyboard.
  • cursor_position: Get the current (x, y) pixel coordinate of the cursor on the screen.
  • mouse_move: Move the cursor to a specified (x, y) pixel coordinate on the screen.
  • Example: execute(action=“mouse_move”, coordinate=(100, 200))
  • left_click: Click the left mouse button.
  • left_click_drag: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
  • right_click: Click the right mouse button.
  • middle_click: Click the middle mouse button.
  • double_click: Double-click the left mouse button.
  • screenshot: Take a screenshot.

Here is an approval policy that requires approval for key combos (e.g. Enter or a shortcut) and mouse clicks:

approval.yaml
approvers:-name: humantools:- computer(action='key'- computer(action='left_click'- computer(action='middle_click'- computer(action='double_click'-name: autotools:"*"

Note that since this is a prefix match and there could be other arguments, we don’t end the tool match pattern with a parentheses.

You can apply this policy using the--approval command line option:

inspect eval computer.py--approval approval.yaml

Tool Binding

The computer tool’s schema is a superset of the standardAnthropic andOpen AI computer tool schemas. When using models tuned for computer use (currentlyanthropic/claude-3-7-sonnet-latest andopenai/computer-use-preview) the computer tool will automatically bind to the native computer tool definitions (as this presumably provides improved performance).

If you want to experiment with bypassing the native computer tool types and just register the computer tool as a normal function based tool then specify the--no-internal-tools generation option as follows:

inspect eval computer.py--no-internal-tools

Think

Thethink() tool provides models with the ability to include an additional thinking step as part of getting to its final answer.

Note that thethink() tool is not a substitute for reasoning and extended thinking, but rather an an alternate way of letting models express thinking that is better suited to some tool use scenarios.

Usage

You should read the originalthink tool article in its entirely to understand where and where not to use the think tool. In summary, good contexts for the think tool include:

  1. Tool output analysis. When models need to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;
  2. Policy-heavy environments. When models need to follow detailed guidelines and verify compliance; and
  3. Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Use thethink() tool alongside other tools like this:

from inspect_aiimport Task, taskfrom inspect_ai.scorerimport includesfrom inspect_ai.solverimport generate, system_message, use_toolsfrom inspect_ai.toolimport bash_session, text_editor, think@taskdef intercode_ctf():return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            use_tools([                bash_session(timeout=180),                text_editor(timeout=180),                think()            ]),            generate(),        ],        scorer=includes(),        sandbox=("docker","compose.yaml")    )

Tool Description

In the originalthink tool article (which was based on experimenting with Claude) they found that providing clear instructions on when and how to use thethink() tool for the particular problem domain it is being used within could sometimes be helpful. For example, here’s the prompt they used with SWE-Bench:

from textwrapimport dedentfrom inspect_aiimport Task, taskfrom inspect_ai.scorerimport includesfrom inspect_ai.solverimport generate, system_message, use_toolsfrom inspect_ai.toolimport bash_session, text_editor, think@taskdef swe_bench():    tools= [        bash_session(timeout=180),        text_editor(timeout=180),        think(dedent("""            Use the think tool to think about something. It will not obtain            new information or make any changes to the repository, but just            log the thought. Use it when complex reasoning or brainstorming            is needed. For example, if you explore the repo and discover            the source of a bug, call this tool to brainstorm several unique            ways of fixing the bug, and assess which change(s) are likely to            be simplest and most effective. Alternatively, if you receive            some test results, call this tool to brainstorm ways to fix the            failing tests.        """))    ])return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            use_tools(tools),            generate(),        ),        scorer=includes(),        sandbox=("docker","compose.yaml")    )

System Prompt

In the article they also found that when tool instructions are long and/or complex, including instructions about thethink() tool in the system prompt can be more effective than placing them in the tool description itself.

Here’s an example of moving the customthink() prompt into the system prompt (note that this wasnot done in the article’s SWE-Bench experiment, this is merely an example):

from textwrapimport dedentfrom inspect_aiimport Task, taskfrom inspect_ai.scorerimport includesfrom inspect_ai.solverimport generate, system_message, use_toolsfrom inspect_ai.toolimport bash_session, text_editor, think@taskdef swe_bench():    think_system_message= system_message(dedent("""        Use the think tool to think about something. It will not obtain        new information or make any changes to the repository, but just        log the thought. Use it when complex reasoning or brainstorming        is needed. For example, if you explore the repo and discover        the source of a bug, call this tool to brainstorm several unique        ways of fixing the bug, and assess which change(s) are likely to        be simplest and most effective. Alternatively, if you receive        some test results, call this tool to brainstorm ways to fix the        failing tests.    """))return Task(        dataset=read_dataset(),        solver=[            system_message("system.txt"),            think_system_message,            use_tools([                bash_session(timeout=180),                text_editor(timeout=180),                think(),            ]),            generate(),        ],        scorer=includes(),        sandbox=("docker","compose.yaml")    )

Note that the effectivess of using the system prompt will vary considerably across tasks, tools, and models, so should definitely be the subject of experimentation.


[8]ページ先頭

©2009-2025 Movatter.jp