- Notifications
You must be signed in to change notification settings - Fork746
Create durable runner for top-k variant selection#5245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Draft
amishler wants to merge40 commits intoalan/evals-split-file-add-testsChoose a base branch fromalan/evals-topk-runner-rebased
base:alan/evals-split-file-add-tests
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
Draft
Create durable runner for top-k variant selection#5245
amishler wants to merge40 commits intoalan/evals-split-file-add-testsfromalan/evals-topk-runner-rebased
+9,646 −1,688
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
* Only write streaming provider-proxy cache body on finishWe were previously writing the current body when the stream wasdropped, even if it never finished (e.g. due to a timeout).As a result, we could end up writing an invalid body to diskdue to an *earlier* request from the e2e tests (if the e2e testshit an internal timeout and dropped the stream), overwritingthe good cache line from a later successful test run* Wait for provider-proxy write before sending back responseThis should help prevent race conditions involving client-side retries* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Fix clippy* Forward Cloudflare R2 env vars into fixtures containerSince these were unset inside the container, we were fallingback to the slow dev-endpoint download, rather than the fastaws cli download* Install aws cli in fixtures Dockerfile* added multi-threaded runtime for test that was exploding---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>Co-authored-by: Viraj Mehta <viraj@tensorzero.com>
…es (#5189)* added a get config route* added migration for tags and missing file* removed failing assertion* added json schema derives to uninitialized variant and evaluation types* made the migration public from tensorzero-core crate
* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI* Stream inferences and datapoints in UI---------Co-authored-by: Viraj Mehta <viraj@tensorzero.com>
This should reduce the flakiness of this test (we were previouslyonly discarding the thought blocks in some cases)
* added a get config route* added migration for tags and missing file* removed failing assertion* made the migration public from tensorzero-core crate* added generic action handler* added e2e tests* added rust client method* use rust client in tests* fixed PR comments
Closes#4783.Co-authored-by: Gabriel Bianconi <1275491+GabrielBianconi@users.noreply.github.com>
* Add support for extra_headers/extra_body in relay modeThe relay gateway now forwards these options to the downstreamgateway (after performing variant-level filtering on the relaygateway).* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Remove collectCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Run fmt* Add some unit tests* Add more unit tests---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Aaron Hill <aaron@tensorzero.com>
…modal (#5197)* Add searchable evaluation and variant selectors to Launch Evaluation modal- Add generic combobox UI components (Combobox, ComboboxInput, ComboboxContent, useCombobox hook)- Add EvaluationSelector component that wraps the generic Combobox- Add VariantSelector component that wraps the generic Combobox- Update LaunchEvaluationModal to use searchable selectors instead of static Select- Update Input component focus style to use border darkening instead of ring* Add documentation comment to useCombobox hook* Update e2e tests for Combobox selectors in Launch Evaluation modal
)* e2e tests: feedback: construct payloads using structs + serializeContributes to#4710.* rely on Serialize---------Co-authored-by: Aaron Hill <aaron@tensorzero.com>
* Prevent duplicate fake-path templates from being loadedWhen loading templates from the config, any fake-path keysshould be unique (these potentially come from agent-generatedvariant configs, which will have inline templates)* Fix typoCo-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>* Fix test---------Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Clean up Docker Compose for UI* Lower log level for config message
* Tag filter for != should still require the tag to be set* Add test
* Initial partial implementation of betting-based confidence sequences* Fix bets to use previous time step's variance estimate* Add updates to wealth processes* Add last step of confidence sequence calculation from hedged wealth process* Change product to log-sum-exp for numerical stability* Add computation of point estimator for the mean* Combine weighting and hedged wealth process computation* Move bet calculation to helper function, add bet truncation* Tweak comment string for clarity* Add module-level and type-level docstrings* Remove max vs combo hedging choice, explain use of max in docstring* For computing mean estimate, restrict search to inside confidence interval for efficiency* Add unit tests, fix search for interval endpoints* Tweak test and test comments* Fix regularized mean value, add regression tests with known values* Move test, clarify comment strings in another test* Add test with known confidence sequence values* Change tests to use non-constant observations for variance fluctuations* Rename module to contrast with asymptotic_confidence_sequences.rs* Add input validation and associated tests* Remove unnecessary iterator method for m-values* Make return type a Result, use anyhow for errors* Create enums and structs* Add check_topk_stopping(), allow dead code as needed* Add epsilon argument, changed from allow to expect dead_code* Add wip note to docstring* Add tests using non-zero epsilon tolerance* Add more tests with k_min not equal to k_max* Add docstrings to tests* Change return value when k_min > num_variants* Tweak docstrings* Remove enum that will be included in a separate PR* Add VariantStatus enum* Refactor evaluations to accept a vector of variants for batch processing* Remove changes from topk.rs for now* Use partition_point() instead of binary search for finding confidence sequence bounds* Make enum for specifying grid of points where wealth processes are calculated* Move argument validation for WealthProcessGridPoints to new constructor methods* Fix bug where variant could count itself when checking how many variants it beats* Add variant_names option to CLI and python client* Fix tests to use new WealthProcessGridPoints enum* Don't multiply num_datapoints by num_variants* Update some tests to use variant_names instead of variant_name* Add deprecation warning for variant_name argument* Add deprecation warning for variant_name arg to python client* Factor out batch evals functionality into helper function* Revert change to allow passing multiple variants to run_evaluations()* Add variant back into tracing span* Pre-resolve datapoint inputs to minimize clones
* Migrate getWorkflowEvaluationProjects* Add clickhouse e2e tests
…5216)Bumps the rust-dependencies group with 4 updates in the / directory: [reqwest](https://github.com/seanmonstar/reqwest), [minijinja](https://github.com/mitsuhiko/minijinja), [rcgen](https://github.com/rustls/rcgen) and [tree-sitter](https://github.com/tree-sitter/tree-sitter).Updates `reqwest` from 0.12.25 to 0.12.26- [Release notes](https://github.com/seanmonstar/reqwest/releases)- [Changelog](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md)- [Commits](seanmonstar/reqwest@v0.12.25...v0.12.26)Updates `minijinja` from 2.13.0 to 2.14.0- [Release notes](https://github.com/mitsuhiko/minijinja/releases)- [Changelog](https://github.com/mitsuhiko/minijinja/blob/main/CHANGELOG.md)- [Commits](mitsuhiko/minijinja@2.13.0...2.14.0)Updates `rcgen` from 0.14.5 to 0.14.6- [Release notes](https://github.com/rustls/rcgen/releases)- [Commits](rustls/rcgen@v0.14.5...v0.14.6)Updates `tree-sitter` from 0.25.10 to 0.26.3- [Release notes](https://github.com/tree-sitter/tree-sitter/releases)- [Commits](tree-sitter/tree-sitter@v0.25.10...v0.26.3)---updated-dependencies:- dependency-name: reqwest dependency-version: 0.12.26 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: rust-dependencies- dependency-name: minijinja dependency-version: 2.14.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: rust-dependencies- dependency-name: rcgen dependency-version: 0.14.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: rust-dependencies- dependency-name: tree-sitter dependency-version: 0.26.3 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: rust-dependencies...Signed-off-by: dependabot[bot] <support@github.com>Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5* Claude 3.7 -> 4.5
* Tag filter for != should still require the tag to be set* Add test* Fix other NULL filters* Fix* Fix
* Migrate countInferencesForEpisode* Update route
This setting has significant performance implications,and is useful to know when viewing the soon-to-be-added'overhead' metric
This test is constantly flaking
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.