Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add multimodal tool outputs#149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
domenic wants to merge3 commits intomain
base:main
Choose a base branch
Loading
frommultimodal-tools
Open

Add multimodal tool outputs#149

domenic wants to merge3 commits intomainfrommultimodal-tools

Conversation

@domenic
Copy link
Contributor

@domenicdomenic commentedAug 28, 2025
edited by pr-previewbot
Loading

(Note that the PR diff involves moving the whole tool use section down below the multimodal inputs section. The new parts are in the "Tool return values" subsection.)

Potential points of discussion:

  • How do we feel about theexpectedOutputs design I added here? It reuses existing types and patterns, so is kind of nice. And it could be expanded in the future withexpectedOutputs: { schema: ... } } forTool-calling: would output schemas be useful? #137. (It's slightly displeasing to have a nested object instead of matching MCP'soutputSchema though.)

  • In my example I used a non-object for my input schema. I wonder if that will actually work with our current implementations; has anyone tested?

  • IDL bikeshedding: I renamed the{ type, value } tuple fromLanguageModelMessageContent toLanguageModelMessageContentChunk, so that we could useLanguageModelMessageContent for the typedef ofstring or { type, value }. Does that seem OK? (It's unobservable to web content, like all dictionary and typedef names.)

Note that we should probably merge this after#148, and then we can add a forward-reference discussing the connection between avoiding concurrency and the mutex pattern I use here.Done


Preview |Diff

@michaelwasserman
Copy link

Heads up@FrankLi-MSFT,@sushraja-msft,@bokand,@bwalderman; your thoughts are appreciated!

Copy link

@michaelwassermanmichaelwasserman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lgtm with mostly minor comments/qs, ty!

README.md Outdated

Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Similarly, expected output languages can be provided (via`expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.

This comment was marked as resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you clarify "there is no prompt-time checking of the tool call result's languages"?

IIUC: impls needn't check the language of tool response strings against the expected set? Also, impls can (and probably should?), check the toolexpectedOutputs languages against the specifiedexpectedInputLanguages in the call tocreate(), right?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

IIUC: impls needn't check the language of tool response strings against the expected set?

Yes, that's what I meant. In more detail:

  • If you haveexpectedOutputs: { types: ["text"] }, or just omitexpectedOutputs so you get the default of only-text, and then your tool returns[{ type: "image", value: whatever }], the implementation will fail the tool call.

  • However, if you haveexpectedOutputs: { languages: ["ja"] }, and then your tool returns"Hello this is English", the implementation will not fail your tool call.

Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

I think they're separate. If your tool is a translation tool, for example, your expected prompt input languages and your expected tool output languages are quite different.

minimum:0,
exclusiveMaximum:videoEl.duration
},
expectedOutputs: {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.

Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.

inputSchema: {
type:"number",
minimum:0,
exclusiveMaximum:videoEl.duration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believevideo.currentTime = video.duration is valid to get the last frame, so we should consider usingmaximum instead ofexclusiveMaximum

});
```

Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be a"NotSupportedError"DOMException. I'll incorporate that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks!

constresult=awaitsession.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
```

might call the above`"getWeather"` tool's`execute()` function three times. The model would wait for all tool call results to return, using the equivalent of`Promise.all()` internally, before it composes its final response.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If one of the tool calls fail, which error would be surfaced to theprompt() call?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The error thrown by the tool. I think this is implied by thePromise.all() reference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Then, this meansession.prompt may fail with aNotSupportedError for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would not fail with a"NotSupportedError"DOMException, unless that's what the web developer threw from theirexecute() function. It would fail with whatever exception the developers threw.

Rethrowing exceptions that developers throw is common, e.g., it's done bysetTimeout() or other async scheduling functions.

beaufortfrancois reacted with thumbs up emoji
Co-authored-by: François Beaufort <beaufort.francois@gmail.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

2 more reviewers

@beaufortfrancoisbeaufortfrancoisbeaufortfrancois left review comments

@michaelwassermanmichaelwassermanmichaelwasserman approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@domenic@michaelwasserman@beaufortfrancois

[8]ページ先頭

©2009-2025 Movatter.jp