NotificationsYou must be signed in to change notification settings
Fork746
Star10.7k

Create durable runner for top-k variant selection#5245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

amishler wants to merge40 commits intoalan/evals-split-file-add-tests

base:alan/evals-split-file-add-tests

Choose a base branch

fromalan/evals-topk-runner-rebased

Draft

Create durable runner for top-k variant selection#5245

amishler wants to merge40 commits intoalan/evals-split-file-add-testsfromalan/evals-topk-runner-rebased

+9,646 −1,688

Conversation

Copy link

Member

amishler commentedDec 17, 2025

No description provided.

Aaron1011and others added30 commits

December 16, 2025 01:43

Only write streaming provider-proxy cache body on finish (#5200)

84da5c0

* Only write streaming provider-proxy cache body on finishWe were previously writing the current body when the stream wasdropped, even if it never finished (e.g. due to a timeout).As a result, we could end up writing an invalid body to diskdue to an *earlier* request from the e2e tests (if the e2e testshit an internal timeout and dropped the stream), overwritingthe good cache line from a later successful test run* Wait for provider-proxy write before sending back responseThis should help prevent race conditions involving client-side retries* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Fix clippy* Forward Cloudflare R2 env vars into fixtures containerSince these were unset inside the container, we were fallingback to the slow dev-endpoint download, rather than the fastaws cli download* Install aws cli in fixtures Dockerfile* added multi-threaded runtime for test that was exploding---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>Co-authored-by: Viraj Mehta <viraj@tensorzero.com>

Bump to 2025.12.3 (#5206)

b96b7d2

added json schema derives to uninitialized variant and evaluation typ…

dab2c58

…es (#5189)* added a get config route* added migration for tags and missing file* removed failing assertion* added json schema derives to uninitialized variant and evaluation types* made the migration public from tensorzero-core crate

Stream inferences and datapoints in UI (#5183)

5ba54a3

* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI---------Co-authored-by: Viraj Mehta <viraj@tensorzero.com>

Exclude more thought blocks in test_short_inference_request (#5211)

c6d3637

This should reduce the flakiness of this test (we were previouslyonly discarding the thought blocks in some cases)

Add an action route (#5193)

9f23171

* added a get config route* added migration for tags and missing file* removed failing assertion* made the migration public from tensorzero-core crate* added generic action handler* added e2e tests* added rust client method* use rust client in tests* fixed PR comments

ci: enforce EOF newlines (#5187)

08841f2

Closes#4783.Co-authored-by: Gabriel Bianconi <1275491+GabrielBianconi@users.noreply.github.com>

Add support for extra_headers/extra_body in relay mode (#5119)

7e4ebac

* Add support for extra_headers/extra_body in relay modeThe relay gateway now forwards these options to the downstreamgateway (after performing variant-level filtering on the relaygateway).* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Remove collectCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Run fmt* Add some unit tests* Add more unit tests---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Make git an optional feature (#5161)

6777d51

Co-authored-by: Aaron Hill <aaron@tensorzero.com>

Add searchable evaluation and variant selectors to Launch Evaluation …

e04c2c3

…modal (#5197)* Add searchable evaluation and variant selectors to Launch Evaluation modal- Add generic combobox UI components (Combobox, ComboboxInput, ComboboxContent, useCombobox hook)- Add EvaluationSelector component that wraps the generic Combobox- Add VariantSelector component that wraps the generic Combobox- Update LaunchEvaluationModal to use searchable selectors instead of static Select- Update Input component focus style to use border darkening instead of ring* Add documentation comment to useCombobox hook* Update e2e tests for Combobox selectors in Launch Evaluation modal

e2e tests: feedback: construct payloads using structs + serialize (#5191

91553bf

)* e2e tests: feedback: construct payloads using structs + serializeContributes to#4710.* rely on Serialize---------Co-authored-by: Aaron Hill <aaron@tensorzero.com>

Prevent duplicate fake-path templates from being loaded (#5209)

f1cc226

* Prevent duplicate fake-path templates from being loadedWhen loading templates from the config, any fake-path keysshould be unique (these potentially come from agent-generatedvariant configs, which will have inline templates)* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Fix test---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Clean up Docker Compose for UI (#5180)

5e122af

* Clean up Docker Compose for UI* Lower log level for config message

Tag filter should require the tag to be set (#5210)

bc18ea9

* Tag filter for != should still require the tag to be set* Add test

e2e tests: datapoints: construct payloads using structs + serialize (#…

0c6f90f

…5201)Contributes to#4710.

Forward R2 credentials into all 'fixtures' docker compose services (#…

029558d

…5219)

Fix clippy: explicit_into_iter_loop (#5222)

0ea08b9

Factor some evaluation functionality into helpers (#5163)

af0d0e4

* Initial partial implementation of betting-based confidence sequences* Fix bets to use previous time step's variance estimate* Add updates to wealth processes* Add last step of confidence sequence calculation from hedged wealth process* Change product to log-sum-exp for numerical stability* Add computation of point estimator for the mean* Combine weighting and hedged wealth process computation* Move bet calculation to helper function, add bet truncation* Tweak comment string for clarity* Add module-level and type-level docstrings* Remove max vs combo hedging choice, explain use of max in docstring* For computing mean estimate, restrict search to inside confidence interval for efficiency* Add unit tests, fix search for interval endpoints* Tweak test and test comments* Fix regularized mean value, add regression tests with known values* Move test, clarify comment strings in another test* Add test with known confidence sequence values* Change tests to use non-constant observations for variance fluctuations* Rename module to contrast with asymptotic_confidence_sequences.rs* Add input validation and associated tests* Remove unnecessary iterator method for m-values* Make return type a Result, use anyhow for errors* Create enums and structs* Add check_topk_stopping(), allow dead code as needed* Add epsilon argument, changed from allow to expect dead_code* Add wip note to docstring* Add tests using non-zero epsilon tolerance* Add more tests with k_min not equal to k_max* Add docstrings to tests* Change return value when k_min > num_variants* Tweak docstrings* Remove enum that will be included in a separate PR* Add VariantStatus enum* Refactor evaluations to accept a vector of variants for batch processing* Remove changes from topk.rs for now* Use partition_point() instead of binary search for finding confidence sequence bounds* Make enum for specifying grid of points where wealth processes are calculated* Move argument validation for WealthProcessGridPoints to new constructor methods* Fix bug where variant could count itself when checking how many variants it beats* Add variant_names option to CLI and python client* Fix tests to use new WealthProcessGridPoints enum* Don't multiply num_datapoints by num_variants* Update some tests to use variant_names instead of variant_name* Add deprecation warning for variant_name argument* Add deprecation warning for variant_name arg to python client* Factor out batch evals functionality into helper function* Revert change to allow passing multiple variants to run_evaluations()* Add variant back into tracing span* Pre-resolve datapoint inputs to minimize clones

Migrate getWorkflowEvaluationProjects (#5173)

c227dd7

* Migrate getWorkflowEvaluationProjects* Add clickhouse e2e tests

added the key info to the request extensions from tensorzero-auth (#5217

5fea119

)* added the key info to the request extensions from tensorzero-auth* put in a single request extension

Bump the rust-dependencies group across 1 directory with 4 updates (#…

f2b2917

…5216)Bumps the rust-dependencies group with 4 updates in the / directory: [reqwest](https://github.com/seanmonstar/reqwest), [minijinja](https://github.com/mitsuhiko/minijinja), [rcgen](https://github.com/rustls/rcgen) and [tree-sitter](https://github.com/tree-sitter/tree-sitter).Updates `reqwest` from 0.12.25 to 0.12.26- [Release notes](https://github.com/seanmonstar/reqwest/releases)- [Changelog](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md)- [Commits](seanmonstar/reqwest@v0.12.25...v0.12.26)Updates `minijinja` from 2.13.0 to 2.14.0- [Release notes](https://github.com/mitsuhiko/minijinja/releases)- [Changelog](https://github.com/mitsuhiko/minijinja/blob/main/CHANGELOG.md)- [Commits](mitsuhiko/minijinja@2.13.0...2.14.0)Updates `rcgen` from 0.14.5 to 0.14.6- [Release notes](https://github.com/rustls/rcgen/releases)- [Commits](rustls/rcgen@v0.14.5...v0.14.6)Updates `tree-sitter` from 0.25.10 to 0.26.3- [Release notes](https://github.com/tree-sitter/tree-sitter/releases)- [Commits](tree-sitter/tree-sitter@v0.25.10...v0.26.3)---updated-dependencies:- dependency-name: reqwest  dependency-version: 0.12.26  dependency-type: direct:production  update-type: version-update:semver-patch  dependency-group: rust-dependencies- dependency-name: minijinja  dependency-version: 2.14.0  dependency-type: direct:production  update-type: version-update:semver-minor  dependency-group: rust-dependencies- dependency-name: rcgen  dependency-version: 0.14.6  dependency-type: direct:production  update-type: version-update:semver-patch  dependency-group: rust-dependencies- dependency-name: tree-sitter  dependency-version: 0.26.3  dependency-type: direct:production  update-type: version-update:semver-minor  dependency-group: rust-dependencies...Signed-off-by: dependabot[bot] <support@github.com>Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Claude 3.7 -> 4.5 (#5213)

041c27e

* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5

Fix additional filters (#5218)

6805c8a

* Tag filter for != should still require the tag to be set* Add test* Fix other NULL filters* Fix* Fix

Migrate countDatapointsForEvaluation (#5135)

499dffe

Migrate countInferencesForEpisode (#5132)

c3ae0cf

* Migrate countInferencesForEpisode* Update route

Log async_writes setting on startup (#5230)

5629875

This setting has significant performance implications,and is useful to know when viewing the soon-to-be-added'overhead' metric

Migrate queryLatestFeedbackIdByMetric to gateway (#5145)

807f75c

Frontend

implement__repr__for optimization-related PyClass structs (#5235)

74c9666

Contributes to#3460.

Disable test_streaming_invalid_request on Fireworks (#5238)

d6ed953

This test is constantly flaking

Fix clippy: cloned_instead_of_copied (#5239)

fa59eca

amishler added7 commits

December 17, 2025 11:31

Squash commits and rebase so github recognizes file lineages

e488eb8

Add durable dependency

cdafd1e

Add durable migrations to manual migrations function

c8ad5db

Add durable migration for top-k variant selection with evals

68e5dfc

Remove durable feature flag

ff2bf3b

Merge branch 'alan/enable-durable' into alan/evals-topk-runner-rebased

9f56849

Make durable and rand dependencies non-optional.

8c735d2

amishler changed the base branch fromalan/evals-split-file-add-tests tomain

December 17, 2025 19:57

amishler changed the base branch frommain toalan/evals-split-file-add-tests

December 17, 2025 19:58

amishler added3 commits

December 17, 2025 15:18

Add helper to create Durable client

ed29f50

Add e2e tests for durable top-k variant selection

b823494

Change e2e tests to use deterministic evaluators

570847a

github-actionsbot added the has-merge-conflicts label

Dec 18, 2025

Labels

has-merge-conflicts

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create durable runner for top-k variant selection#5245

Are you sure you want to change the base?

Create durable runner for top-k variant selection#5245

Conversation

amishler commentedDec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants