NotificationsYou must be signed in to change notification settings
Fork132
Star1.5k

Add Hybrid Filtering Approach (sem_filter_hybrid)#165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

Selich wants to merge1 commit intolotus-data:main

base:main

Choose a base branch

fromSelich:nikola/sem_filter_hybrid

Draft

Add Hybrid Filtering Approach (sem_filter_hybrid)#165

Selich wants to merge1 commit intolotus-data:mainfromSelich:nikola/sem_filter_hybrid

Conversation

Copy link

Selich commentedApr 5, 2025•
edited
Loading

This PR implements and evaluates a hybrid filtering approach that uses keyword generation. The method significantly reduces LLM calls while maintaining comparable accuracy, with efficiency improvements becoming more pronounced as dataset size increases.

Comparison Methods

Standard: Semantic filtering with LLM calls for each data point
Hybrid (with varying preference values): Our proposed method with different preference settings

sem_filter_hybrid

Usage Example

importpandasaspdimportlotusfromlotus.modelsimportLM# Create or load your dataframedf=pd.DataFrame({"reviews": ["Great product!","I didn't like it","Amazing experience"]})# Configure Lotuslm=LM(model="gpt-4o-mini")# or your preferred modellotus.settings.configure(lm=lm)# Use sem_filter_hybrid to extract positive reviewsresult=df.sem_filter_hybrid("Extract positive reviews from {reviews}",return_explanations=True,accuracy_cost_preference=0.5# Balance between accuracy and cost savings)print(result)

Hybrid Filter Parameters

accuracy_cost_preference: (float, default=0.5)
- Balance between accuracy (1.0) and cost savings (0.0)
- Higher values prioritize accuracy, lower values prioritize cost/speed
sample_percentage_for_keywords: (float, default=0.1) (10%)
- Percentage of dataframe to sample for keyword generation

code_generation.py

The core component of this module is theCodeGenerator class, which helps generate keywords for filtering operations.

Usage Example

fromlotus.hybrid_optimizer.code_generatorimportCodeGeneratorimportpandasaspdfromlotus.modelsimportLM# Create a code generatorcode_generator=CodeGenerator()# Sample datadf=pd.DataFrame({"reviews": ["Great product!","Terrible experience","Amazing service"]})# Generate keywords for positive sentiment detectionlm=LM(model="gpt-4o-mini")keywords=code_generator.generate_keywords(df=df,instruction="Extract positive reviews",example_data=df.to_string(),model=lm,num_keyword_calls=1)print(keywords)# ['great', 'amazing', 'positive', ...]

Templates

Prompt messages and templates are in the/lotus/hybrid_optimizer/templates folder.

Visualization Highlights

Comparison of F1 Score (left) and Execution Time (right) for Standard and Hybrid approaches with different keyword sampling percentages. (N=100)

Semantic Filter Performance Comparison for the ollama/llama3.1 model. (N=100)

add basic sem_filter_hybrid

0fe3640

Selich marked this pull request as draft

April 5, 2025 14:27

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Hybrid Filtering Approach (sem_filter_hybrid)#165

Are you sure you want to change the base?

Add Hybrid Filtering Approach (sem_filter_hybrid)#165

Uh oh!

Conversation

Selich commentedApr 5, 2025•
edited
Loading

Uh oh!