Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add playground auto-evaluation support#5224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
simeonlee wants to merge24 commits intomain
base:main
Choose a base branch
Loading
fromsimeonlee/playground-auto-eval-1-core

Conversation

@simeonlee
Copy link
Collaborator

@simeonleesimeonlee commentedDec 16, 2025
edited
Loading

Summary

Add auto-evaluation capability to the playground, allowing users to run evaluations on inference outputs directly from the playground interface.This feature is behind a feature flag and disabled by default.

Feature Flag

To enable playground evaluations, set the environment variable:

VITE_TENSORZERO_FF_PLAYGROUND_EVALS=1

Backend Changes

  • Update tensorzero-node to supportdatapointIds for evaluating specific datapoints
  • AddinternalDynamicVariantConfig parameter for evaluating edited (non-builtin) variants
  • AddrunNativeEvaluationStreaming wrapper in native_client.server.ts

API

  • Add/api/playground/evaluate streaming endpoint using NDJSON format
  • Configure cache modes: inference useson (read from cache), evaluation also uses cache for determinism

UI Components

  • AddEvaluationCombobox component for selecting evaluations with evaluator names annotation
  • AddMetricBadge component for displaying evaluation results with threshold-based red coloring
  • AddAnimatedEllipsis component for loading states
  • RefactorCombobox to supportgetItemAnnotation prop for showing annotations in both dropdown and input

Playground Integration

  • AddusePlaygroundEvaluation hook encapsulating evaluation streaming logic
  • Integrate evaluation selector into playground UI (behind feature flag)
  • Display evaluation metrics below each inference output
  • Support both builtin and edited variants
  • Auto-run evaluation when inference completes and evaluation is selected
  • Reset evaluation when function changes

MetricBadge Threshold Coloring

  • Red background/text when value fails threshold (optimize="max" and value < cutoff, or optimize="min" and value > cutoff)
  • For booleans: optimize="max" wants true, optimize="min" wants false
  • ReusesisCutoffFailed from MetricValue for consistency

Styling

  • Updatemuted-foreground color to be more neutral (remove blue tint)

Test plan

  • SetVITE_TENSORZERO_FF_PLAYGROUND_EVALS=1 and verify evaluation selector appears
  • Select an evaluation in playground and verify metrics appear below inference
  • Test with edited variants (should still work)
  • Test clear button (X) on evaluation selector
  • Verify evaluation re-runs when variant config changes
  • Verify evaluation resets when function changes
  • Test threshold coloring: values below cutoff (for max) or above cutoff (for min) show red

ellipsis-dev[bot] reacted with rocket emoji
- Add EvaluationSelector component for selecting evaluations in playground- Add /api/playground/evaluate streaming endpoint for running evaluations- Add MetricBadge component for displaying evaluation results- Update tensorzero-node to support datapointIds and dynamic variant configs- Fix ThreadsafeFunction callback race condition by removing premature abort calls
@simeonleesimeonlee changed the titleAdd playground auto-evaluation supportWIP: Add playground auto-evaluation supportDec 16, 2025
@simeonleesimeonlee marked this pull request as draftDecember 16, 2025 18:44
- Extend Combobox with clearable and getItemSuffix props- Add clear button support to ComboboxInput- Create EvaluationCombobox using the extended Combobox- Remove EvaluationSelector in favor of EvaluationCombobox
- Inference: write_only (always fresh, write to cache for eval)- Evaluation: read_only (use cached inference result)This ensures the evaluation uses the exact same inference outputthat was displayed in the playground.
@simeonleesimeonlee marked this pull request as ready for reviewDecember 16, 2025 20:17
@simeonleesimeonlee changed the titleWIP: Add playground auto-evaluation supportAdd playground auto-evaluation supportDec 16, 2025
Tests that selecting an evaluation in the playground showsevaluation results (evaluator names) after inference completes.
- Clear evaluation param when function changes (evaluations are function-specific)- Sort evaluation names in dropdown alphabetically- Sort evaluator names in annotations alphabetically
@virajmehta
Copy link
Member

/regen-fixtures

github-actions[bot] reacted with hooray emojigithub-actions[bot] reacted with rocket emojigithub-actions[bot] reacted with eyes emoji

- Add VITE_TENSORZERO_FF_PLAYGROUND_EVALS feature flag- Conditionally render EvaluationCombobox when flag is enabled- Enable flag by default in e2e CI docker-compose
- Extract evaluation logic into usePlaygroundEvaluation hook- Add documentation to evaluate.ts endpoint- Revert inference cache change (keep cache="on" for now)- Remove@credentials from e2e test (uses cached inferences)
- Add optimize and cutoff props for threshold comparison- Import and reuse isCutoffFailed from MetricValue- Red coloring when value fails threshold (max: value < cutoff, min: value > cutoff)- For booleans: max wants true, min wants false- Add Storybook stories for threshold variants
Label now uses darker red-700, value uses lighter red-500
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@virajmehtavirajmehtavirajmehta requested changes

Requested changes must be addressed to merge this pull request.

Assignees

@virajmehtavirajmehta

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@simeonlee@virajmehta

[8]ページ先頭

©2009-2025 Movatter.jp