Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fix duplicate index creation in sem_join#218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
Ohm-Rajpal wants to merge3 commits intolotus-data:main
base:main
Choose a base branch
Loading
fromOhm-Rajpal:bugfix/duplicate-index-join-fix

Conversation

@Ohm-Rajpal
Copy link

Purpose

The purpose of this PR is to fix the duplicated index and inefficiency, addressingIssue #70

  • Added utility function in utils.py which gets cache path for semantic index and reuses the index if the data has not changed
  • Improved logic inside of sem_index.py to enable the user to specify the index directory or let my helper function create a directory in the cache
  • Updated the function call to sem_index inside of sem_join.py to ensure double index creations are not made and the cache is being hit

Test Plan

Created four test cases to ensure full functionality and of the code. The first test case, test_backwards_compatibility, ensures backwards compatibility in the code where the index_dir parameter still works as before, creating indices in the specified directory rather than the cache. The next tests test the caching functionality where test_cache_directory_location tests cache paths are created in ~/.cache/lotus/indices/, test_cache_files_exist checks that FAISS index and vec files are created in the cache directory, and finally the last test test_dir_name tests the caching logic where identical data produces identical cache paths due to the same SHA256 hash whereas different data produces different paths as expected.

Here are the commands inside of tests/test_index_cache.py that have been added and their expected inputs and outcomes:

  • test_backwards_compatibility(self, sample_df, sample_df_2): Test backward compatibility where the index directory is passed.
  • test_cache_directory_location(self, sample_df): Test that cache is created in ~/.cache/lotus/indices/
  • test_cache_files_exist(self, sample_df): Test that cache creates both 'index' and 'vecs' files.
  • test_dir_name(self, sample_df, sample_df_2): Test that calling sem_index twice with same data reuses the cache, while
    different data uses different caches.

Test Results

All four added tests inside of test_index_cache.py passed locally, confirming correct caching behavior for sem_index.

Documentation Update

The documentation for sem_index may need updates as now my PR eliminates the need for explicitly passing in an index directory.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Refactoring (no functional changes)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, updating docstrings
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

BEFORE SUBMITTING, PLEASE READhttps://github.com/lotus-data/lotus/blob/main/CONTRIBUTING.md
anything written below this line will be removed by GitHub Actions

… index and reuses the index if the data has not changed. Improved logic inside of sem_index.py to enable the user to specify the index directory or let my helper function create a directory in the cache. Updated the function call to sem_index inside of sem_join.py to ensure double index creations are not made and the cache is being hit.
cache_directory=get_index_cache(col_name,data,model_name)

# Check if index already exists (verify the index FILE exists, not just directory)
cache_path=Path(cache_directory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if this is dependent on the vector store used, this should be shifted inside VectorStore and then call the function here.

vs.load_index(cache_directory)
lotus.logger.info(f"cache hit at{cache_directory}")
else:
# previous code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

please add comments so that it makes sense for the code as a whole and not just for this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

for e.g., this should be something like "Index does not exist. Creating new one"


# Check if index already exists (verify the index FILE exists, not just directory)
cache_path=Path(cache_directory)
ifcache_path.exists()and (cache_path/"index").exists():# currently only works for FAISS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just checking Cache path exists is not enough. We should also check if the data is consistent.

Adding a flag to the function will be helpful to denote whether to override the index or not. (if data is not consistent, then, depending on the fla,g it should override the path or raise an error)

>>> path = get_index_cache('category', data, 'text-embedding-3-small')
>>> # Returns: "~/.cache/lotus/indices/category_298bba13f072bd92cb40396e87cb2aaa"
"""
content=str(sorted(data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This will create multiple copies of the index as data gets added. I would avoid this, especially given that we don't have a way to clean them without errors


@operator_cache
def__call__(self,col_name:str,index_dir:str)->pd.DataFrame:
def__call__(self,col_name:str,index_dir:str|None=None)->pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think it's a good idea to allow index_dir to be None here. col_name can be the same across multiple tables, and that will cause an issue. While I can see utility for having it None here, still forcing the user to give a directory is better.

@liana313 wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

agreed

re comments for readability, improved tests to check consistent data, fixed caching issues and added deterministic caching in run_sem_sim_join, enforced index_dir variable inside of operator cache __call__ method to only be a string.
vs.index(self._obj[col_name],embeddings,index_dir)

# Store metadata for data consistency checking (FAISS only).
ifhasattr(vs,"_store_metadata"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this should be called directly inindex function.
Definitely not good to call a private function outside class.

content+=f"__{model_name}"
current_hash=hashlib.sha256(content.encode()).hexdigest()[:32]

returnmetadata.get("data_hash")==current_hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we should verify the index configs as well.

The data in vector db depends on the original data, model to create embeddings, and the vector store configs.

)

# Create data hash for consistency checking.
content=str(sorted(data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hash creation is better as a function in the vector store class so that its not repeated everywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we can just pass the data in the index function and handle hash creation inside the class itself.
Creating a common function likehash_data that is used in both store_metadata and is_data_consistent would simplify things

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@liana313liana313liana313 left review comments

@harshitgupta412harshitgupta412harshitgupta412 left review comments

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@Ohm-Rajpal@liana313@harshitgupta412

[8]ページ先頭

©2009-2025 Movatter.jp