Add PostgreSQL Connection Retry Mechanism with Network Robustness
Summary
Adds automatic retry logic with exponential backoff for PostgreSQL database operations to handle transient network failures gracefully. This enhancement significantly improves system reliability in production environments with unstable network connections.
Problem Statement
Currently, the PostgreSQL implementation lacks resilience against transient network issues:
- ❌ Single connection failure causes immediate operation failure
- ❌ No automatic recovery from temporary network interruptions
- ❌ Connection pool issues can cascade into system-wide failures
- ❌ No protection against connection pool deadlocks during cleanup
This makes the system fragile in production environments with:
- Unstable network connections
- Database server restarts
- Network infrastructure maintenance
- Temporary connection pool exhaustion
Solution
Implements a comprehensive retry mechanism using thetenacity library with the following features:
Core Retry Logic
- ✅Automatic retries on transient failures (configurable, default: 3 attempts)
- ✅Exponential backoff with configurable min/max delays
- ✅Connection pool reset after failures to ensure fresh connections
- ✅Thread-safe pool management with asyncio locks
- ✅Timeout protection for pool cleanup operations
Transient Error Handling
Automatically retries on:
asyncio.TimeoutError /TimeoutErrorConnectionError /OSErrorasyncpg.exceptions.InterfaceErrorasyncpg.exceptions.TooManyConnectionsErrorasyncpg.exceptions.CannotConnectNowErrorasyncpg.exceptions.PostgresConnectionErrorasyncpg.exceptions.ConnectionDoesNotExistErrorasyncpg.exceptions.ConnectionFailureError
Non-Transient Errors
Correctly handles without retry:
UniqueViolationErrorDuplicateTableError- Other application-level errors
Technical Implementation
Key Changes tolightrag/kg/postgres_impl.py
New Retry Configuration (lines 82-119)
# Configurable via environment variablesself.connection_retry_attempts=max(1,min(10,int(os.environ.get("POSTGRES_CONNECTION_RETRIES",3))))self.connection_retry_backoff=max(0.1,min(5.0,float(os.environ.get("POSTGRES_CONNECTION_RETRY_BACKOFF",0.5))))self.connection_retry_backoff_max=max(self.connection_retry_backoff,min(60.0,float(...)))self.pool_close_timeout=max(1.0,min(30.0,float(os.environ.get("POSTGRES_POOL_CLOSE_TIMEOUT",5.0))))Central Retry Method_run_with_retry() (lines 298-346)
- Orchestrates retry logic for all database operations
- Ensures pool availability before operations
- Handles AGE configuration when needed
- Implements exponential backoff strategy
Pool Management Methods
_ensure_pool(): Lazy pool initialization with double-check locking_reset_pool(): Safe pool cleanup with timeout protection_before_sleep(): Retry callback with logging and pool reset
Enhanced Error Reporting (lines 3296-3299)
"detail":repr(e),# Better debugging info"error_type":e.__class__.__name__,# Exception type tracking
Refactored Methods
query(): Now uses_run_with_retry()execute(): Improved error handling with proper logginginitdb(): Retry logic for initialization
Configuration
All parameters configurable via environment variables with safe defaults:
| Environment Variable | Default | Min | Max | Description |
|---|
POSTGRES_CONNECTION_RETRIES | 3 | 1 | 10 | Number of retry attempts |
POSTGRES_CONNECTION_RETRY_BACKOFF | 0.5s | 0.1s | 5.0s | Initial backoff delay |
POSTGRES_CONNECTION_RETRY_BACKOFF_MAX | 5.0s | backoff | 60.0s | Maximum backoff delay |
POSTGRES_POOL_CLOSE_TIMEOUT | 5.0s | 1.0s | 30.0s | Pool close timeout |
Example configuration:
POSTGRES_CONNECTION_RETRIES=3POSTGRES_CONNECTION_RETRY_BACKOFF=0.5POSTGRES_CONNECTION_RETRY_BACKOFF_MAX=5.0POSTGRES_POOL_CLOSE_TIMEOUT=5.0
Behavior Changes
Improved Reliability
- Before: Single network hiccup → immediate failure
- After: Automatic retry with exponential backoff → eventual success
Better Logging
- Warning logs for each retry attempt with attempt number
- Detailed error information with
repr() instead ofstr() - Exception type tracking for better debugging
Configuration Visibility
Startup log shows active retry configuration:
INFO: PostgreSQL, Retry config: attempts=3, backoff=0.50s, backoff_max=5.00s, pool_close_timeout=5.00s
Breaking Changes
None. This is a backward-compatible enhancement:
- ✅ All existing code continues to work unchanged
- ✅ Default behavior same as before (with retries enabled)
- ✅ Optional environment variables for customization
- ✅ No API changes
Performance Impact
- Normal operations: Negligible overhead (single pool check)
- Failure scenarios: Additional latency from retries (expected trade-off)
- Concurrent operations: No additional overhead (lock-free in success path)
Uh oh!
There was an error while loading.Please reload this page.
Add PostgreSQL Connection Retry Mechanism with Network Robustness
Summary
Adds automatic retry logic with exponential backoff for PostgreSQL database operations to handle transient network failures gracefully. This enhancement significantly improves system reliability in production environments with unstable network connections.
Problem Statement
Currently, the PostgreSQL implementation lacks resilience against transient network issues:
This makes the system fragile in production environments with:
Solution
Implements a comprehensive retry mechanism using the
tenacitylibrary with the following features:Core Retry Logic
Transient Error Handling
Automatically retries on:
asyncio.TimeoutError/TimeoutErrorConnectionError/OSErrorasyncpg.exceptions.InterfaceErrorasyncpg.exceptions.TooManyConnectionsErrorasyncpg.exceptions.CannotConnectNowErrorasyncpg.exceptions.PostgresConnectionErrorasyncpg.exceptions.ConnectionDoesNotExistErrorasyncpg.exceptions.ConnectionFailureErrorNon-Transient Errors
Correctly handles without retry:
UniqueViolationErrorDuplicateTableErrorTechnical Implementation
Key Changes to
lightrag/kg/postgres_impl.pyNew Retry Configuration (lines 82-119)
Central Retry Method
_run_with_retry()(lines 298-346)Pool Management Methods
_ensure_pool(): Lazy pool initialization with double-check locking_reset_pool(): Safe pool cleanup with timeout protection_before_sleep(): Retry callback with logging and pool resetEnhanced Error Reporting (lines 3296-3299)
Refactored Methods
query(): Now uses_run_with_retry()execute(): Improved error handling with proper logginginitdb(): Retry logic for initializationConfiguration
All parameters configurable via environment variables with safe defaults:
POSTGRES_CONNECTION_RETRIESPOSTGRES_CONNECTION_RETRY_BACKOFFPOSTGRES_CONNECTION_RETRY_BACKOFF_MAXPOSTGRES_POOL_CLOSE_TIMEOUTExample configuration:
Behavior Changes
Improved Reliability
Better Logging
repr()instead ofstr()Configuration Visibility
Startup log shows active retry configuration:
Breaking Changes
None. This is a backward-compatible enhancement:
Performance Impact