- Notifications
You must be signed in to change notification settings - Fork746
Add playground auto-evaluation support#5224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
- Add EvaluationSelector component for selecting evaluations in playground- Add /api/playground/evaluate streaming endpoint for running evaluations- Add MetricBadge component for displaying evaluation results- Update tensorzero-node to support datapointIds and dynamic variant configs- Fix ThreadsafeFunction callback race condition by removing premature abort calls
Uh oh!
There was an error while loading.Please reload this page.
- Extend Combobox with clearable and getItemSuffix props- Add clear button support to ComboboxInput- Create EvaluationCombobox using the extended Combobox- Remove EvaluationSelector in favor of EvaluationCombobox
- Inference: write_only (always fresh, write to cache for eval)- Evaluation: read_only (use cached inference result)This ensures the evaluation uses the exact same inference outputthat was displayed in the playground.
Tests that selecting an evaluation in the playground showsevaluation results (evaluator names) after inference completes.
- Clear evaluation param when function changes (evaluations are function-specific)- Sort evaluation names in dropdown alphabetically- Sort evaluator names in annotations alphabetically
virajmehta commentedDec 16, 2025
/regen-fixtures |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
- Add VITE_TENSORZERO_FF_PLAYGROUND_EVALS feature flag- Conditionally render EvaluationCombobox when flag is enabled- Enable flag by default in e2e CI docker-compose
- Extract evaluation logic into usePlaygroundEvaluation hook- Add documentation to evaluate.ts endpoint- Revert inference cache change (keep cache="on" for now)- Remove@credentials from e2e test (uses cached inferences)
- Add optimize and cutoff props for threshold comparison- Import and reuse isCutoffFailed from MetricValue- Red coloring when value fails threshold (max: value < cutoff, min: value > cutoff)- For booleans: max wants true, min wants false- Add Storybook stories for threshold variants
Label now uses darker red-700, value uses lighter red-500
Uh oh!
There was an error while loading.Please reload this page.
Summary
Add auto-evaluation capability to the playground, allowing users to run evaluations on inference outputs directly from the playground interface.This feature is behind a feature flag and disabled by default.
Feature Flag
To enable playground evaluations, set the environment variable:
Backend Changes
datapointIdsfor evaluating specific datapointsinternalDynamicVariantConfigparameter for evaluating edited (non-builtin) variantsrunNativeEvaluationStreamingwrapper in native_client.server.tsAPI
/api/playground/evaluatestreaming endpoint using NDJSON formaton(read from cache), evaluation also uses cache for determinismUI Components
EvaluationComboboxcomponent for selecting evaluations with evaluator names annotationMetricBadgecomponent for displaying evaluation results with threshold-based red coloringAnimatedEllipsiscomponent for loading statesComboboxto supportgetItemAnnotationprop for showing annotations in both dropdown and inputPlayground Integration
usePlaygroundEvaluationhook encapsulating evaluation streaming logicMetricBadge Threshold Coloring
isCutoffFailedfrom MetricValue for consistencyStyling
muted-foregroundcolor to be more neutral (remove blue tint)Test plan
VITE_TENSORZERO_FF_PLAYGROUND_EVALS=1and verify evaluation selector appears