Added utility function in utils.py which gets cache path for semantic index and reuses the index if the data has not changed
Improved logic inside of sem_index.py to enable the user to specify the index directory or let my helper function create a directory in the cache
Updated the function call to sem_index inside of sem_join.py to ensure double index creations are not made and the cache is being hit

Test Plan

Created four test cases to ensure full functionality and of the code. The first test case, test_backwards_compatibility, ensures backwards compatibility in the code where the index_dir parameter still works as before, creating indices in the specified directory rather than the cache. The next tests test the caching functionality where test_cache_directory_location tests cache paths are created in ~/.cache/lotus/indices/, test_cache_files_exist checks that FAISS index and vec files are created in the cache directory, and finally the last test test_dir_name tests the caching logic where identical data produces identical cache paths due to the same SHA256 hash whereas different data produces different paths as expected.

Here are the commands inside of tests/test_index_cache.py that have been added and their expected inputs and outcomes:

test_backwards_compatibility(self, sample_df, sample_df_2): Test backward compatibility where the index directory is passed.
test_cache_directory_location(self, sample_df): Test that cache is created in ~/.cache/lotus/indices/
test_cache_files_exist(self, sample_df): Test that cache creates both 'index' and 'vecs' files.
test_dir_name(self, sample_df, sample_df_2): Test that calling sem_index twice with same data reuses the cache, while
different data uses different caches.

Test Results

All four added tests inside of test_index_cache.py passed locally, confirming correct caching behavior for sem_index.

Documentation Update

The documentation for sem_index may need updates as now my PR eliminates the need for explicitly passing in an index directory.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Refactoring (no functional changes)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, updating docstrings
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

BEFORE SUBMITTING, PLEASE READhttps://github.com/lotus-data/lotus/blob/main/CONTRIBUTING.md
anything written below this line will be removed by GitHub Actions

Added utility function in utils.py which gets cache path for semantic…

4c1c2ab

… index and reuses the index if the data has not changed. Improved logic inside of sem_index.py to enable the user to specify the index directory or let my helper function create a directory in the cache. Updated the function call to sem_index inside of sem_join.py to ensure double index creations are not made and the cache is being hit.

harshitgupta412 reviewed

Oct 14, 2025

View reviewed changes

lotus/sem_ops/sem_index.py Outdated

		cache_directory=get_index_cache(col_name,data,model_name)

		# Check if index already exists (verify the index FILE exists, not just directory)
		cache_path=Path(cache_directory)

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if this is dependent on the vector store used, this should be shifted inside VectorStore and then call the function here.

lotus/sem_ops/sem_index.py Outdated

		vs.load_index(cache_directory)
		lotus.logger.info(f"cache hit at{cache_directory}")
		else:
		# previous code

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

please add comments so that it makes sense for the code as a whole and not just for this PR.

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

for e.g., this should be something like "Index does not exist. Creating new one"

lotus/sem_ops/sem_index.py Outdated


		# Check if index already exists (verify the index FILE exists, not just directory)
		cache_path=Path(cache_directory)
		ifcache_path.exists()and (cache_path/"index").exists():# currently only works for FAISS

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just checking Cache path exists is not enough. We should also check if the data is consistent.

Adding a flag to the function will be helpful to denote whether to override the index or not. (if data is not consistent, then, depending on the fla,g it should override the path or raise an error)

lotus/utils.py Outdated

		>>> path = get_index_cache('category', data, 'text-embedding-3-small')
		>>> # Returns: "~/.cache/lotus/indices/category_298bba13f072bd92cb40396e87cb2aaa"
		"""
		content=str(sorted(data))

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This will create multiple copies of the index as data gets added. I would avoid this, especially given that we don't have a way to clean them without errors

lotus/sem_ops/sem_index.py Outdated


		@operator_cache
		def__call__(self,col_name:str,index_dir:str)->pd.DataFrame:
		def__call__(self,col_name:str,index_dir:str\|None=None)->pd.DataFrame:

Copy link

Collaborator

harshitgupta412Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think it's a good idea to allow index_dir to be None here. col_name can be the same across multiple tables, and that will cause an issue. While I can see utility for having it None here, still forcing the user to give a directory is better.

@liana313 wdyt?

Copy link

Collaborator

liana313Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

agreed

Ohm-Rajpal added2 commits

October 25, 2025 02:10

Moved cache existence to VectorStore class, added mo

b622add

re comments for readability, improved tests to check consistent data, fixed caching issues and added deterministic caching in run_sem_sim_join, enforced index_dir variable inside of operator cache __call__ method to only be a string.

Cleaning up test cases

52702be

harshitgupta412 reviewed

Oct 31, 2025

View reviewed changes

lotus/sem_ops/sem_index.py

		vs.index(self._obj[col_name],embeddings,index_dir)

		# Store metadata for data consistency checking (FAISS only).
		ifhasattr(vs,"_store_metadata"):

Copy link

Collaborator

harshitgupta412Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this should be called directly inindex function.
Definitely not good to call a private function outside class.

lotus/vector_store/vs.py

		content+=f"__{model_name}"
		current_hash=hashlib.sha256(content.encode()).hexdigest()[:32]

		returnmetadata.get("data_hash")==current_hash

Copy link

Collaborator

harshitgupta412Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we should verify the index configs as well.

The data in vector db depends on the original data, model to create embeddings, and the vector store configs.

lotus/sem_ops/sem_index.py

		)

		# Create data hash for consistency checking.
		content=str(sorted(data))

Copy link

Collaborator

harshitgupta412Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hash creation is better as a function in the vector store class so that its not repeated everywhere.

Copy link

Collaborator

harshitgupta412Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we can just pass the data in the index function and handle hash creation inside the class itself.
Creating a common function likehash_data that is used in both store_metadata and is_data_consistent would simplify things

Labels

None yet

3 participants

Movatterモバイル変換

Fix duplicate index creation in sem_join#218

Are you sure you want to change the base?

Fix duplicate index creation in sem_join#218

Conversation

Ohm-Rajpal commentedOct 7, 2025

Purpose

Test Plan

Test Results

Documentation Update

Type of Change

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants