- Notifications
You must be signed in to change notification settings - Fork132
Fix duplicate index creation in sem_join#218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Fix duplicate index creation in sem_join#218
Conversation
… index and reuses the index if the data has not changed. Improved logic inside of sem_index.py to enable the user to specify the index directory or let my helper function create a directory in the cache. Updated the function call to sem_index inside of sem_join.py to ensure double index creations are not made and the cache is being hit.
lotus/sem_ops/sem_index.py Outdated
| cache_directory=get_index_cache(col_name,data,model_name) | ||
| # Check if index already exists (verify the index FILE exists, not just directory) | ||
| cache_path=Path(cache_directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
if this is dependent on the vector store used, this should be shifted inside VectorStore and then call the function here.
lotus/sem_ops/sem_index.py Outdated
| vs.load_index(cache_directory) | ||
| lotus.logger.info(f"cache hit at{cache_directory}") | ||
| else: | ||
| # previous code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
please add comments so that it makes sense for the code as a whole and not just for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
for e.g., this should be something like "Index does not exist. Creating new one"
lotus/sem_ops/sem_index.py Outdated
| # Check if index already exists (verify the index FILE exists, not just directory) | ||
| cache_path=Path(cache_directory) | ||
| ifcache_path.exists()and (cache_path/"index").exists():# currently only works for FAISS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just checking Cache path exists is not enough. We should also check if the data is consistent.
Adding a flag to the function will be helpful to denote whether to override the index or not. (if data is not consistent, then, depending on the fla,g it should override the path or raise an error)
lotus/utils.py Outdated
| >>> path = get_index_cache('category', data, 'text-embedding-3-small') | ||
| >>> # Returns: "~/.cache/lotus/indices/category_298bba13f072bd92cb40396e87cb2aaa" | ||
| """ | ||
| content=str(sorted(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This will create multiple copies of the index as data gets added. I would avoid this, especially given that we don't have a way to clean them without errors
lotus/sem_ops/sem_index.py Outdated
| @operator_cache | ||
| def__call__(self,col_name:str,index_dir:str)->pd.DataFrame: | ||
| def__call__(self,col_name:str,index_dir:str|None=None)->pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't think it's a good idea to allow index_dir to be None here. col_name can be the same across multiple tables, and that will cause an issue. While I can see utility for having it None here, still forcing the user to give a directory is better.
@liana313 wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
agreed
re comments for readability, improved tests to check consistent data, fixed caching issues and added deterministic caching in run_sem_sim_join, enforced index_dir variable inside of operator cache __call__ method to only be a string.
| vs.index(self._obj[col_name],embeddings,index_dir) | ||
| # Store metadata for data consistency checking (FAISS only). | ||
| ifhasattr(vs,"_store_metadata"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this should be called directly inindex function.
Definitely not good to call a private function outside class.
| content+=f"__{model_name}" | ||
| current_hash=hashlib.sha256(content.encode()).hexdigest()[:32] | ||
| returnmetadata.get("data_hash")==current_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
we should verify the index configs as well.
The data in vector db depends on the original data, model to create embeddings, and the vector store configs.
| ) | ||
| # Create data hash for consistency checking. | ||
| content=str(sorted(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hash creation is better as a function in the vector store class so that its not repeated everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
we can just pass the data in the index function and handle hash creation inside the class itself.
Creating a common function likehash_data that is used in both store_metadata and is_data_consistent would simplify things
Purpose
The purpose of this PR is to fix the duplicated index and inefficiency, addressingIssue #70
Test Plan
Created four test cases to ensure full functionality and of the code. The first test case, test_backwards_compatibility, ensures backwards compatibility in the code where the index_dir parameter still works as before, creating indices in the specified directory rather than the cache. The next tests test the caching functionality where test_cache_directory_location tests cache paths are created in ~/.cache/lotus/indices/, test_cache_files_exist checks that FAISS index and vec files are created in the cache directory, and finally the last test test_dir_name tests the caching logic where identical data produces identical cache paths due to the same SHA256 hash whereas different data produces different paths as expected.
Here are the commands inside of tests/test_index_cache.py that have been added and their expected inputs and outcomes:
different data uses different caches.
Test Results
All four added tests inside of test_index_cache.py passed locally, confirming correct caching behavior for sem_index.
Documentation Update
The documentation for sem_index may need updates as now my PR eliminates the need for explicitly passing in an index directory.
Type of Change
Checklist
BEFORE SUBMITTING, PLEASE READhttps://github.com/lotus-data/lotus/blob/main/CONTRIBUTING.md
anything written below this line will be removed by GitHub Actions