Add auto-evaluation capability to the playground, allowing users to run evaluations on inference outputs directly from the playground interface.This feature is behind a feature flag and disabled by default.

Feature Flag

To enable playground evaluations, set the environment variable:

VITE_TENSORZERO_FF_PLAYGROUND_EVALS=1

Backend Changes

Update tensorzero-node to supportdatapointIds for evaluating specific datapoints
AddinternalDynamicVariantConfig parameter for evaluating edited (non-builtin) variants
AddrunNativeEvaluationStreaming wrapper in native_client.server.ts

API

Add/api/playground/evaluate streaming endpoint using NDJSON format
Configure cache modes: inference useson (read from cache), evaluation also uses cache for determinism

UI Components

AddEvaluationCombobox component for selecting evaluations with evaluator names annotation
AddMetricBadge component for displaying evaluation results with threshold-based red coloring
AddAnimatedEllipsis component for loading states
RefactorCombobox to supportgetItemAnnotation prop for showing annotations in both dropdown and input

Playground Integration

AddusePlaygroundEvaluation hook encapsulating evaluation streaming logic
Integrate evaluation selector into playground UI (behind feature flag)
Display evaluation metrics below each inference output
Support both builtin and edited variants
Auto-run evaluation when inference completes and evaluation is selected
Reset evaluation when function changes

MetricBadge Threshold Coloring

Red background/text when value fails threshold (optimize="max" and value < cutoff, or optimize="min" and value > cutoff)
For booleans: optimize="max" wants true, optimize="min" wants false
ReusesisCutoffFailed from MetricValue for consistency

Styling

Updatemuted-foreground color to be more neutral (remove blue tint)

Test plan

SetVITE_TENSORZERO_FF_PLAYGROUND_EVALS=1 and verify evaluation selector appears
Select an evaluation in playground and verify metrics appear below inference
Test with edited variants (should still work)
Test clear button (X) on evaluation selector
Verify evaluation re-runs when variant config changes
Verify evaluation resets when function changes
Test threshold coloring: values below cutoff (for max) or above cutoff (for min) show red

simeonlee added2 commits

December 16, 2025 13:39

Add playground auto-evaluation support

4bd0d15

- Add EvaluationSelector component for selecting evaluations in playground- Add /api/playground/evaluate streaming endpoint for running evaluations- Add MetricBadge component for displaying evaluation results- Update tensorzero-node to support datapointIds and dynamic variant configs- Fix ThreadsafeFunction callback race condition by removing premature abort calls

Improve EvaluationSelector clear UX with X button

820a909

simeonlee changed the title~~Add playground auto-evaluation support~~WIP: Add playground auto-evaluation support

Dec 16, 2025

simeonlee marked this pull request as draft

December 16, 2025 18:44

simeonlee commented

Dec 16, 2025

View reviewed changes

internal/tensorzero-node/src/lib.rsShow resolvedHide resolved

simeonlee added12 commits

December 16, 2025 14:03

Restore comments for refetch behavior and memo TODO

6864a61

Refactor to EvaluationCombobox using extended Combobox component

2a62312

- Extend Combobox with clearable and getItemSuffix props- Add clear button support to ComboboxInput- Create EvaluationCombobox using the extended Combobox- Remove EvaluationSelector in favor of EvaluationCombobox

Simplify MetricBadge for PR1, fix comment

3d7b63d

Remove period from empty message

3a36905

Remove chevron transition, default monospace to true

5aa698f

Remove monospace prop, inline font-mono

8208d9d

Add MetricBadge storybook stories

c84504a

Rename suffix to annotation, show in input, fix X color, neutral mute…

a8cdb60

…d-foreground

Annotation in combobox input, fix alignment and spacing, neutral mute…

89305cb

…d-foreground

Fix annotation alignment in combobox input

67b7215

Improve MetricBadge styling: darker text, more gap

bd21fe1

Fix cache modes to ensure deterministic playground evaluation

200e980

- Inference: write_only (always fresh, write to cache for eval)- Evaluation: read_only (use cached inference result)This ensures the evaluation uses the exact same inference outputthat was displayed in the playground.

simeonlee marked this pull request as ready for review

December 16, 2025 20:17

simeonlee changed the title~~WIP: Add playground auto-evaluation support~~Add playground auto-evaluation support

Dec 16, 2025

simeonlee added4 commits

December 16, 2025 15:22

Add e2e test for playground evaluation feature

cc6a0eb

Tests that selecting an evaluation in the playground showsevaluation results (evaluator names) after inference completes.

Sort evaluation badges alphabetically

95f499c

Clear evaluation when function changes, sort lists alphabetically

37e1f2c

- Clear evaluation param when function changes (evaluations are function-specific)- Sort evaluation names in dropdown alphabetically- Sort evaluator names in annotations alphabetically

Fix comment accuracy: dataset may not have datapoints for new function

9e562ef

Copy link

Member

virajmehta commentedDec 16, 2025

/regen-fixtures

Regenerate ModelInferenceCache fixtures

1ef2a63

virajmehta requested changes

Dec 16, 2025

View reviewed changes

internal/tensorzero-node/src/lib.rsShow resolvedHide resolved

ui/app/routes/playground/route.tsxShow resolvedHide resolved

ui/app/routes/playground/route.tsx OutdatedShow resolvedHide resolved

ui/app/routes/playground/DatapointPlaygroundOutput.tsx OutdatedShow resolvedHide resolved

ui/app/routes/api/playground/evaluate.tsShow resolvedHide resolved

ui/app/routes/api/tensorzero/inference.utils.tsx OutdatedShow resolvedHide resolved

ui/e2e_tests/playground.spec.ts OutdatedShow resolvedHide resolved

simeonlee added4 commits

December 16, 2025 17:12

Add feature flag for playground evaluations

0cdb136

- Add VITE_TENSORZERO_FF_PLAYGROUND_EVALS feature flag- Conditionally render EvaluationCombobox when flag is enabled- Enable flag by default in e2e CI docker-compose

Address remaining PR feedback

9994a68

- Extract evaluation logic into usePlaygroundEvaluation hook- Add documentation to evaluate.ts endpoint- Revert inference cache change (keep cache="on" for now)- Remove@credentials from e2e test (uses cached inferences)

Remove e2e test for feature-flagged playground evaluation

d1c40c4

Add threshold-based red coloring to MetricBadge

63ad790

- Add optimize and cutoff props for threshold comparison- Import and reuse isCutoffFailed from MetricValue- Red coloring when value fails threshold (max: value < cutoff, min: value > cutoff)- For booleans: max wants true, min wants false- Add Storybook stories for threshold variants

Swap red colors in MetricBadge threshold styling

02804b7

Label now uses darker red-700, value uses lighter red-500

GabrielBianconi assignedvirajmehta

Dec 17, 2025

Labels

None yet

Movatterモバイル変換

Add playground auto-evaluation support#5224

Are you sure you want to change the base?

Add playground auto-evaluation support#5224

Conversation

simeonlee commentedDec 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Summary

Feature Flag

Backend Changes

API

UI Components

Playground Integration

MetricBadge Threshold Coloring

Styling

Test plan

Uh oh!

Uh oh!

virajmehta commentedDec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simeonlee commentedDec 16, 2025•
edited
Loading