NotificationsYou must be signed in to change notification settings
Fork132
Star1.5k

Implement semantic map + filter#209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

IsmaelKabore wants to merge3 commits intolotus-data:main

base:main

Choose a base branch

fromIsmaelKabore:feature/semantic-map-filter

Open

Implement semantic map + filter#209

IsmaelKabore wants to merge3 commits intolotus-data:mainfromIsmaelKabore:feature/semantic-map-filter

+587 −16

Conversation

Copy link

IsmaelKabore commentedSep 4, 2025•
edited
Loading

Purpose

This PR implements the requested enhancements for sem_filter, addressing [Issue#200]

Specifically:

n_sample parameter for repeated sampling at test time.

Ensembling strategies (e.g., majority_vote, average_prob/mean_prob) to stabilize outputs.

Stores per-run outputs:preds,parsed_outputs,explanations,logprobs

Placed before "SemanticFilterOutput" to avoid forward reference issues

New .github/tests e2e tests using Ollama for df.sem_filter().

Summary of Changes

New: lotus/sampling_utils.py

resample_batch(call_once, n_sample, *args, **kwargs) – generic, args/kwargs forwarded (no fixed signature).

apply_ensemble(strategy: EnsembleStrategy, ...) – collapses [n_sample][batch] → [batch].

Supports PICK_FIRST, MAJORITY (generic hashable), and MEAN_BOOL (booleans only).

Updated:

Adds n_sample, ensemble: EnsembleStrategy, temperature.

Validates n_sample > 0.

Uses resample_batch → postprocess each run → apply_ensemble (returns labels + chosen indices).

Raw outputs, explanations, and (optionally) logprobs are taken from the chosen run per item.

Preserves existing cascade logic; helper model path also honors sampling and ensembling for its own pass.

Returns complete rollout data inSemanticFilterOutput.raw_outputs_all_runs

Preserves existing cascade logic; helper model path also honors sampling and ensembling

Test Plan

I tested the new sem_filter and sem_map features by running the full pytest suite with the required environment variables set.

Ran .github/tests/test_sem_filter.py to validate df.sem_filter() on simple sentiment filtering and the sampling path (n_sample, temperature).

Behavioral checks:

Ensured outputs exist, are strings, and negative examples are excluded in filtering.

Confirmed tolerant assertions .

Verified new parameters (n_sample, ensemble, temperature) integrate without breaking defaults.

Static checks:

Passed ruff, ruff-format, and mypy locally.

Test Results

All tests for sem_filter and sem_map passed locally, confirming correct behavior and stability

(Optional) Documentation Update

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Refactoring (no functional changes)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, updating docstrings
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

BEFORE SUBMITTING, PLEASE READhttps://github.com/lotus-data/lotus/blob/main/CONTRIBUTING.md
anything written below this line will be removed by GitHub Actions

Implement semantic map + filter

0be21c7

IsmaelKabore force-pushed thefeature/semantic-map-filter branch fromccbd704 to0be21c7Compare

October 17, 2025 00:08

harshitgupta412 requested changes

Oct 18, 2025

View reviewed changes

lotus/models/lm.py Outdated


		# Only sleep if the batch was faster than the required time
		# Each request should be spaced by min_interval_per_request
		required_time_for_batch=len(sub_batch)*min_interval_per_request

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

repeated lines

lotus/sem_ops/sem_filter.py Outdated

		ifsafe_mode:
		estimated_total_calls=len(docs)
		estimated_total_cost=sum(model.count_tokens(input)forinputininputs)
		estimated_total_calls=len(docs)*max(1,n_sample)

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we should raise a value error if n_sample is <= 0 since that does not make sense

lotus/sampling_utils.py Outdated

		return"yes"ifmean>=0.5else"no"


		defapply_ensemble(strategy:Optional[str],all_outputs:List[List[str]],*,default_yes:bool)->List[str]:

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Its better to have Enums for the stategy similar to WebSearchCorpus. That makes the values standardized and its clear to user what are the available choices.

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also I think strategy should not be allowed to be None. It does not make sense to apply "None" ensemble. If needed we can add "pick first" strategy and make it default

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Further, the strategies will probably be different for different sem operators (majority vote is not making sense to me for sem_map since it is just asking LM for a generation and can have lot of variations). We should validate this as well