Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add Hybrid Filtering Approach (sem_filter_hybrid)#165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
Selich wants to merge1 commit intolotus-data:main
base:main
Choose a base branch
Loading
fromSelich:nikola/sem_filter_hybrid

Conversation

@Selich
Copy link

@SelichSelich commentedApr 5, 2025
edited
Loading

This PR implements and evaluates a hybrid filtering approach that uses keyword generation. The method significantly reduces LLM calls while maintaining comparable accuracy, with efficiency improvements becoming more pronounced as dataset size increases.

Comparison Methods

  • Standard: Semantic filtering with LLM calls for each data point
  • Hybrid (with varying preference values): Our proposed method with different preference settings

sem_filter_hybrid

Usage Example

importpandasaspdimportlotusfromlotus.modelsimportLM# Create or load your dataframedf=pd.DataFrame({"reviews": ["Great product!","I didn't like it","Amazing experience"]})# Configure Lotuslm=LM(model="gpt-4o-mini")# or your preferred modellotus.settings.configure(lm=lm)# Use sem_filter_hybrid to extract positive reviewsresult=df.sem_filter_hybrid("Extract positive reviews from {reviews}",return_explanations=True,accuracy_cost_preference=0.5# Balance between accuracy and cost savings)print(result)

Hybrid Filter Parameters

  • accuracy_cost_preference: (float, default=0.5)

    • Balance between accuracy (1.0) and cost savings (0.0)
    • Higher values prioritize accuracy, lower values prioritize cost/speed
  • sample_percentage_for_keywords: (float, default=0.1) (10%)

    • Percentage of dataframe to sample for keyword generation

code_generation.py

The core component of this module is theCodeGenerator class, which helps generate keywords for filtering operations.

Usage Example

fromlotus.hybrid_optimizer.code_generatorimportCodeGeneratorimportpandasaspdfromlotus.modelsimportLM# Create a code generatorcode_generator=CodeGenerator()# Sample datadf=pd.DataFrame({"reviews": ["Great product!","Terrible experience","Amazing service"]})# Generate keywords for positive sentiment detectionlm=LM(model="gpt-4o-mini")keywords=code_generator.generate_keywords(df=df,instruction="Extract positive reviews",example_data=df.to_string(),model=lm,num_keyword_calls=1)print(keywords)# ['great', 'amazing', 'positive', ...]

Templates

Prompt messages and templates are in the/lotus/hybrid_optimizer/templates folder.

Visualization Highlights

F1 Score and Execution Time ComparisonComparison of F1 Score (left) and Execution Time (right) for Standard and Hybrid approaches with different keyword sampling percentages. (N=100)

Llama 3.1 PerformanceSemantic Filter Performance Comparison for the ollama/llama3.1 model. (N=100)

@SelichSelich marked this pull request as draftApril 5, 2025 14:27
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@Selich

[8]ページ先頭

©2009-2025 Movatter.jp