NotificationsYou must be signed in to change notification settings
Fork87
Star624

Add multimodal tool outputs#149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

domenic wants to merge3 commits intomain

base:main

Choose a base branch

frommultimodal-tools

Open

Add multimodal tool outputs#149

domenic wants to merge3 commits intomainfrommultimodal-tools

Conversation

Copy link

Contributor

domenic commentedAug 28, 2025•
edited by pr-previewbot
Loading

(Note that the PR diff involves moving the whole tool use section down below the multimodal inputs section. The new parts are in the "Tool return values" subsection.)

Potential points of discussion:

How do we feel about theexpectedOutputs design I added here? It reuses existing types and patterns, so is kind of nice. And it could be expanded in the future withexpectedOutputs: { schema: ... } } forTool-calling: would output schemas be useful? #137. (It's slightly displeasing to have a nested object instead of matching MCP'soutputSchema though.)
In my example I used a non-object for my input schema. I wonder if that will actually work with our current implementations; has anyone tested?
IDL bikeshedding: I renamed the{ type, value } tuple fromLanguageModelMessageContent toLanguageModelMessageContentChunk, so that we could useLanguageModelMessageContent for the typedef ofstring or { type, value }. Does that seem OK? (It's unobservable to web content, like all dictionary and typedef names.)

~~Note that we should probably merge this after#148, and then we can add a forward-reference discussing the connection between avoiding concurrency and the mutex pattern I use here.~~Done

Preview |Diff

domenic mentioned this pull request

Aug 28, 2025

Tool calling: return types?#138

Open

Copy link

michaelwasserman commentedAug 28, 2025

Heads up@FrankLi-MSFT,@sushraja-msft,@bokand,@bwalderman; your thoughts are appreciated!

michaelwasserman approved these changes

Aug 28, 2025

View reviewed changes

Copy link

michaelwasserman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm with mostly minor comments/qs, ty!

README.md Outdated


		Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

		Similarly, expected output languages can be provided (via`expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.

This comment was marked as resolved.

Copy link

michaelwassermanAug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you clarify "there is no prompt-time checking of the tool call result's languages"?

IIUC: impls needn't check the language of tool response strings against the expected set? Also, impls can (and probably should?), check the toolexpectedOutputs languages against the specifiedexpectedInputLanguages in the call tocreate(), right?

Copy link

ContributorAuthor

domenicSep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

IIUC: impls needn't check the language of tool response strings against the expected set?

Yes, that's what I meant. In more detail:

If you haveexpectedOutputs: { types: ["text"] }, or just omitexpectedOutputs so you get the default of only-text, and then your tool returns[{ type: "image", value: whatever }], the implementation will fail the tool call.
However, if you haveexpectedOutputs: { languages: ["ja"] }, and then your tool returns"Hello this is English", the implementation will not fail your tool call.

Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

I think they're separate. If your tool is a translation tool, for example, your expected prompt input languages and your expected tool output languages are quite different.

README.md OutdatedShow resolvedHide resolved

README.md

		minimum:0,
		exclusiveMaximum:videoEl.duration
		},
		expectedOutputs: {

Copy link

michaelwassermanAug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?

Copy link

ContributorAuthor

domenicSep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.

Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.

domenic added2 commits

September 1, 2025 13:28

Add multimodal tool outputs

56caea0

Fix concatenation note

5bbaad4

domenic force-pushed themultimodal-tools branch from69d7bbe to5bbaad4Compare

September 1, 2025 04:31

beaufortfrancois reviewed

Sep 2, 2025

View reviewed changes

README.mdShow resolvedHide resolved

README.md OutdatedShow resolvedHide resolved

README.md

		inputSchema: {
		type:"number",
		minimum:0,
		exclusiveMaximum:videoEl.duration

Copy link

beaufortfrancoisSep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believevideo.currentTime = video.duration is valid to get the last frame, so we should consider usingmaximum instead ofexclusiveMaximum

README.mdShow resolvedHide resolved

README.md

		});
		```

		Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Copy link

beaufortfrancoisSep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?

Copy link

ContributorAuthor

domenicSep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be a"NotSupportedError"DOMException. I'll incorporate that.

Copy link

beaufortfrancoisSep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks!

README.md

		constresult=awaitsession.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
		```

		might call the above`"getWeather"` tool's`execute()` function three times. The model would wait for all tool call results to return, using the equivalent of`Promise.all()` internally, before it composes its final response.

Copy link

beaufortfrancoisSep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If one of the tool calls fail, which error would be surfaced to theprompt() call?

Copy link

ContributorAuthor

domenicSep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The error thrown by the tool. I think this is implied by thePromise.all() reference?

Copy link

beaufortfrancoisSep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Then, this meansession.prompt may fail with aNotSupportedError for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?

Copy link

ContributorAuthor

domenicSep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would not fail with a"NotSupportedError"DOMException, unless that's what the web developer threw from theirexecute() function. It would fail with whatever exception the developers threw.

Rethrowing exceptions that developers throw is common, e.g., it's done bysetTimeout() or other async scheduling functions.

types -> type

6e731c7

Co-authored-by: François Beaufort <beaufort.francois@gmail.com>

Labels

None yet

Movatterモバイル変換

Add multimodal tool outputs#149

Are you sure you want to change the base?

Add multimodal tool outputs#149

Uh oh!

Conversation

domenic commentedAug 28, 2025• edited by pr-previewbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

michaelwasserman commentedAug 28, 2025

Uh oh!

michaelwasserman left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

domenic commentedAug 28, 2025•
edited by pr-previewbot
Loading