- Notifications
You must be signed in to change notification settings - Fork4.1k
Feat: Add Workspace Isolation for Pipeline Status and In-memory Storage#2369
Feat: Add Workspace Isolation for Pipeline Status and In-memory Storage#2369danielaskdd merged 86 commits intomainfrom
Conversation
Problem:In multi-tenant scenarios, different workspaces share a single globalpipeline_status namespace, causing pipelines from different tenants toblock each other, severely impacting concurrent processing performance.Solution:- Extended get_namespace_data() to recognize workspace-specific pipeline namespaces with pattern "{workspace}:pipeline" (following GraphDB pattern)- Added workspace parameter to initialize_pipeline_status() for per-tenant isolated pipeline namespaces- Updated all 7 call sites to use workspace-aware locks: * lightrag.py: process_document_queue(), aremove_document() * document_routes.py: background_delete_documents(), clear_documents(), cancel_pipeline(), get_pipeline_status(), delete_documents()Impact:- Different workspaces can process documents concurrently without blocking- Backward compatible: empty workspace defaults to "pipeline_status"- Maintains fail-fast: uninitialized pipeline raises clear error- Expected N× performance improvement for N concurrent tenantsBug fixes:- Fixed AttributeError by using self.workspace instead of self.global_config- Fixed pipeline status endpoint to show workspace-specific status- Fixed delete endpoint to check workspace-specific busy flagCode changes: 4 files, 141 insertions(+), 28 deletions(-)Testing: All syntax checks passed, comprehensive workspace isolation tests completedFixes two compatibility issues in workspace isolation:1. Problem: lightrag_server.py calls initialize_pipeline_status() without workspace parameter, causing pipeline to initialize in global namespace instead of rag's workspace. Solution: Add set_default_workspace() mechanism in shared_storage. LightRAG.initialize_storages() now sets default workspace, which initialize_pipeline_status() uses when called without parameters.2. Problem: /health endpoint hardcoded to use "pipeline_status", cannot return workspace-specific status or support frontend workspace selection. Solution: Add LIGHTRAG-WORKSPACE header support. Endpoint now extracts workspace from header or falls back to server default, returning correct workspace-specific pipeline status.Changes:- lightrag/kg/shared_storage.py: Add set/get_default_workspace()- lightrag/lightrag.py: Call set_default_workspace() in initialize_storages()- lightrag/api/lightrag_server.py: Add get_workspace_from_request() helper, update /health endpoint to support LIGHTRAG-WORKSPACE headerTesting:- Backward compatibility: Old code works without modification- Multi-instance safety: Explicit workspace passing preserved- /health endpoint: Supports both default and header-specified workspacesRelated:#2353
…heavy `chunking_func` is passed in by user
- Add Awaitable and Union type imports- Update chunking_func type annotation- Handle coroutine results with await- Add return type validation- Update docstring for async support
- Update import from PyPDF2 to pypdf- Change dependency to pypdf>=6.1.0- Update all requirements files- Remove PyPDF2 from lock file- Use modern pypdf library
• Add _sanitize_json_data helper function• Recursively clean strings in data• Sanitize before JSON serialization• Prevent encoding-related crashes• Use existing sanitize_text_for_encoding
• Remove surrogate characters (U+D800-DFFF)• Filter Unicode non-characters• Direct char-by-char filtering
- Sanitize dictionary keys- Preserve tuple types- Handle nested structures better
- Bump API version to 0254- Remove response format UI controls- Hard-code response_type in query params- Add migration for version 19- Clean up settings store structure
- Fast path for clean data (no sanitization)- Slow path sanitizes during encoding- Reload shared memory after sanitization- Custom encoder avoids deep copies- Comprehensive test coverage
- Precompile regex pattern at module level- Zero-copy path for clean strings- Use C-level regex for performance- Remove deprecated _sanitize_json_data- Fast detection for common case
• Reload cleaned data after sanitization• Update shared memory with clean data• Add specific surrogate char tests• Test migration sanitization flow• Prevent dirty data in memory
• Replace truthy checks with `is not None`• Handle empty dict edge case properly• Prevent data reload failures• Add comprehensive test coverage• Fix JsonKVStorage and DocStatusStorage
- Merge offline-docs into api extras- Remove pipmaster dynamic installs- Add async document processing- Pre-check docling availability- Update offline deployment docs
• Add lazy config initialization• Maintain backward compatibility• Support programmatic usage• Add gunicorn dependency• Explicit config in entry points
… host for cloud models
- Add --docling CLI flag for easier setup- Add numpy version constraints- Exclude docling on macOS (fork-safety)
- Initialize result vars to None- Add null checks before consume calls- Prevent crashes in except blocks- Apply fix to both Neo4J and Memgraph
- Add max_token_size=8192 to all embed funcs- Move siliconcloud to deprecated folder- Import wrap_embedding_func_with_attrs- Update EmbeddingFunc docstring- Fix langfuse import type annotation
• Add specific exception types• Implement proper retry mechanism• Better error classification• Enhanced logging and validation• Enable embedding retry decorator
- Add EMBEDDING_TOKEN_LIMIT env var- Set max_token_size on embedding func- Add token limit property to LightRAG- Validate summary length vs limit- Log warning when limit exceeded
• Change warning to info level• Simplify workspace mismatch wording
• Add async sleep to mock functions• Test concurrent ainsert operations• Use asyncio.gather for parallel exec• Measure concurrent execution time
• Add pytest command-line options• Create session-scoped fixtures• Remove hardcoded environment vars• Update test function signatures• Improve configuration priority
• Remove unused tempfile import• Use consistent project temp/ structure• Clean up existing directories first• Create directories with os.makedirs• Use descriptive test directory names
- Add GitHub Actions workflow for CI- Mark integration tests requiring services- Add offline test markers for isolated tests- Skip integration tests by default- Configure pytest markers and collection
* Define pytest marker patterns* Document CI/CD test execution* Specify offline vs integration tests* Add test isolation best practices* Reference testing guidelines doc
danielaskdd commentedNov 18, 2025
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
• Remove pytest group dependency• Add explicit pytest>=8.4.2• Add pytest-asyncio>=1.2.0• Add pre-commit directly• Fix potential circular dependency
• Use rsplit instead of split• Handle colons in workspace names
danielaskdd commentedNov 18, 2025
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
danielaskdd commentedNov 18, 2025
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Uh oh!
There was an error while loading.Please reload this page.
• Track if we acquired the pipeline lock• Auto-acquire pipeline when idle• Only release if we acquired it• Prevent concurrent deletion conflicts• Improve deletion job validation
danielaskdd commentedNov 18, 2025
@codex review |
• TestResult → ExecutionResult• TestStats → ExecutionStats• Update class docstrings• Update type hints• Update variable references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Uh oh!
There was an error while loading.Please reload this page.
• Clarify job naming for single vs batch deletion• Update job name validation in busy pipeline check
danielaskdd commentedNov 18, 2025
@codex review |
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
dfbc973 intomainUh oh!
There was an error while loading.Please reload this page.
Feat: Add Workspace Isolation for Pipeline Status and In-memory Storage
🎯 Problem Statement
When multiple LightRAG objects with different
workspacevalues are instantiated simultaneously, the following issues occur:pipeline_status, causing pipeline states from different workspaces to interfere with each other_pipeline_status_lock,_graph_db_lock,_storage_lock) are not workspace-isolated, causing operations from different workspaces to block each other unnecessarily✨ Solution
1.Workspace Isolation for Pipeline Status
pipeline_statusas a special namespace (storage type), similar to KV storage but without persistence<workspace>:pipeline_status2.Unified Workspace-Based Lock Mechanism
_pipeline_status_lock,_graph_db_lock,_storage_lock_storage_keyed_lock<workspace>:<storage_type>default_key3.New
get_namespace_lock()Function4.Add Workspace Parameter to All Namespace Operations
Updated function signatures to support workspace parameter:
initialize_pipeline_status(workspace: str | None = None)get_namespace_data(namespace: str, first_init: bool = False, workspace: str | None = None)get_update_flag(namespace: str, workspace: str | None = None)set_all_update_flags(namespace: str, workspace: str | None = None)clear_all_update_flags(namespace: str, workspace: str | None = None)get_all_update_flags_status(workspace: str | None = None)try_initialize_namespace(namespace: str, workspace: str | None = None)5.Default Workspace Support (Backward Compatibility)
_default_workspaceset_default_workspace(workspace: str | None = None)get_default_workspace() -> str6.Unified Namespace Naming Convention
Added
get_final_namespace()function:<workspace>:<namespace>or<namespace>(when workspace is empty)7. Standardize empty workspace handling from "_" to "" across storage
8.Auto-initialize pipeline status in
initialize_storages()📝 Key Modified Files
lightrag/kg/shared_storage.py: Core modification fileget_namespace_lock()get_final_namespace()Storage Implementation Files (using new lock mechanism):
lightrag/kg/json_kv_impl.pylightrag/kg/json_doc_status_impl.pylightrag/kg/nano_vector_db_impl.pylightrag/kg/faiss_impl.pylightrag/kg/networkx_impl.pyget_namespace_lock()instead of legacy locksAPI and Core Logic Files:
lightrag/lightrag.py: Set default workspacelightrag/api/lightrag_server.py: Pipeline status initializationlightrag/api/routers/document_routes.py: Use new namespace lock interface🧪 Testing Recommendations
🎉 Expected Outcomes