Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Implement semantic map + filter#209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
IsmaelKabore wants to merge3 commits intolotus-data:main
base:main
Choose a base branch
Loading
fromIsmaelKabore:feature/semantic-map-filter

Conversation

@IsmaelKabore
Copy link

@IsmaelKaboreIsmaelKabore commentedSep 4, 2025
edited
Loading

Purpose

This PR implements the requested enhancements for sem_filter, addressing [Issue#200]

Specifically:

n_sample parameter for repeated sampling at test time.

Ensembling strategies (e.g., majority_vote, average_prob/mean_prob) to stabilize outputs.

Stores per-run outputs:preds,parsed_outputs,explanations,logprobs

Placed before "SemanticFilterOutput" to avoid forward reference issues

New .github/tests e2e tests using Ollama for df.sem_filter().

Summary of Changes

New: lotus/sampling_utils.py

resample_batch(call_once, n_sample, *args, **kwargs) – generic, args/kwargs forwarded (no fixed signature).

apply_ensemble(strategy: EnsembleStrategy, ...) – collapses [n_sample][batch] → [batch].

Supports PICK_FIRST, MAJORITY (generic hashable), and MEAN_BOOL (booleans only).

Updated:

Adds n_sample, ensemble: EnsembleStrategy, temperature.

Validates n_sample > 0.

Uses resample_batch → postprocess each run → apply_ensemble (returns labels + chosen indices).

Raw outputs, explanations, and (optionally) logprobs are taken from the chosen run per item.

Preserves existing cascade logic; helper model path also honors sampling and ensembling for its own pass.

Returns complete rollout data inSemanticFilterOutput.raw_outputs_all_runs

Preserves existing cascade logic; helper model path also honors sampling and ensembling

Test Plan

I tested the new sem_filter and sem_map features by running the full pytest suite with the required environment variables set.

Ran .github/tests/test_sem_filter.py to validate df.sem_filter() on simple sentiment filtering and the sampling path (n_sample, temperature).

Behavioral checks:

Ensured outputs exist, are strings, and negative examples are excluded in filtering.

Confirmed tolerant assertions .

Verified new parameters (n_sample, ensemble, temperature) integrate without breaking defaults.

Static checks:

Passed ruff, ruff-format, and mypy locally.

Test Results

All tests for sem_filter and sem_map passed locally, confirming correct behavior and stability

(Optional) Documentation Update

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Refactoring (no functional changes)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, updating docstrings
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

BEFORE SUBMITTING, PLEASE READhttps://github.com/lotus-data/lotus/blob/main/CONTRIBUTING.md
anything written below this line will be removed by GitHub Actions

@IsmaelKaboreIsmaelKaboreforce-pushed thefeature/semantic-map-filter branch fromccbd704 to0be21c7CompareOctober 17, 2025 00:08

# Only sleep if the batch was faster than the required time
# Each request should be spaced by min_interval_per_request
required_time_for_batch=len(sub_batch)*min_interval_per_request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

repeated lines

ifsafe_mode:
estimated_total_calls=len(docs)
estimated_total_cost=sum(model.count_tokens(input)forinputininputs)
estimated_total_calls=len(docs)*max(1,n_sample)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we should raise a value error if n_sample is <= 0 since that does not make sense

return"yes"ifmean>=0.5else"no"


defapply_ensemble(strategy:Optional[str],all_outputs:List[List[str]],*,default_yes:bool)->List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Its better to have Enums for the stategy similar to WebSearchCorpus. That makes the values standardized and its clear to user what are the available choices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also I think strategy should not be allowed to be None. It does not make sense to apply "None" ensemble. If needed we can add "pick first" strategy and make it default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Further, the strategies will probably be different for different sem operators (majority vote is not making sense to me for sem_map since it is just asking LM for a generation and can have lot of variations). We should validate this as well



defresample_batch(
call_once:Callable[[bool,Optional[float],bool,str],Any],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we can make this general to accept any input and use args+kwargs in resample_batch as arguments. Is there a reason to limit it?


# call model
lm_output:LMOutput=model(inputs,progress_bar_desc=progress_bar_desc,**model_kwargs)
def_call_once(_want_logprobs,temp,spb,desc):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why are there arguments in this function? We can just capture it from parent since its defined inside the sem_* function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

unless there is something changing with different runs, there is no need for args

iflen(run)!=batch:
raiseValueError(f"Inconsistent batch sizes across runs: expected{batch}, got{len(run)}")

chosen:Optional[List[Any]]=logs[0]if (want_logprobsandlen(logs)>0)elseNone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should decide the logprobs based on which output was finally chosen.

README.md Outdated
url = {https://doi.org/10.14778/3749646.3749685},
}
@article{patel2024semanticoperators,
@misc{patel2024semanticoperators,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why is this changed?

return"yes"ifmean>=0.5else"no"


defapply_ensemble(strategy:Optional[str],all_outputs:List[List[str]],*,default_yes:bool)->List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why are extra parameters allowed when they are not used?

fromtypingimportAny,Callable,List,Optional,Sequence,Tuple


def_norm(x:Any)->str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is redoing the job of postprocessors. If needed, postprocessors should be applied before emsembling

)

assertisinstance(filtered,pd.DataFrame)
assert"Text"infiltered.columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we should add assertions checking the output columns are as expected

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@liana313liana313liana313 left review comments

@harshitgupta412harshitgupta412harshitgupta412 requested changes

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@IsmaelKabore@liana313@harshitgupta412

[8]ページ先頭

©2009-2025 Movatter.jp