- Notifications
You must be signed in to change notification settings - Fork376
Adding rank based logging for torch distributed examples#3897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:abose/trt_llm_installation_dist
Are you sure you want to change the base?
Adding rank based logging for torch distributed examples#3897
Uh oh!
There was an error while loading.Please reload this page.
Conversation
31666e3 to52ae92aCompare| returndevice_mesh,world_size,rank | ||
| # Set C++ TensorRT runtime log level based on most verbose handler | ||
| # this is similar to set_log_level() | ||
| cpp_level=min(file_level_int,console_level_int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Dont we have an API that abstracts needing to detect if the C++ runtime is available? If not we should add one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have added a function in _features.py for the above. And also moved all this to logging.py. Let me know if that function placment works
| notis_platform_supported_for_trtllm(), | ||
| "Skipped on Windows, Jetson and CUDA13: NCCL backend is not supported.", | ||
| notis_distributed_nccl_available(), | ||
| "Skipped: NCCL backend is notavailable (Windows/Jetson notsupported).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is it jetson or just Orin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
yeah Orin. Changed to Jetson Orin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 00:37:46.920408+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 00:38:18.669710+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 07:00:24.693914+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 07:00:53.634960+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary
aa4183e to2ea29e4CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 15:34:05.984305+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 15:34:37.144980+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary
2ea29e4 to6833fecCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 22:41:27.269191+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 22:41:58.523912+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary
6833fec tof8befaeCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 23:49:16.116928+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 23:49:48.815063+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-02 23:49:16.689930+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-02 23:50:00.341840+00:00@@ -74,19 +74,19 @@ try: size = os.path.getsize(path) shm_files.append((path, size)) except OSError: shm_files.append((path, -1))-+ # Sort by size descending shm_files.sort(key=lambda x: x[1], reverse=True) for path, size in shm_files: if size >= 0: print(f" {path}: {size / (1024 * 1024):.2f} MB") else: print(f" {path}: <unable to get size>")-+ if not shm_files: print(" (no files found)") except Exception as e: print(f" Error listing /dev/shm: {e}")
f8befae tof40e84bCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 00:44:03.183076+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 00:44:33.293930+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 00:44:03.634077+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 00:44:44.650284+00:00@@ -67,41 +67,39 @@ # List ALL files in /dev/shm to see what's consuming space print("\nAll files in /dev/shm (including hidden):") try: import subprocess+ # Use ls -la to see all files including hidden ones result = subprocess.run(- ["ls", "-la", "/dev/shm"],- capture_output=True,- text=True,- timeout=5+ ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5 ) print(result.stdout)-+ # Also run du to see actual disk usage print("\nDisk usage breakdown (du -sh /dev/shm/*):") result = subprocess.run( ["du", "-sh", "/dev/shm/*"], capture_output=True, text=True, shell=False,- timeout=5+ timeout=5, ) # du with glob needs shell=True result = subprocess.run( "du -sh /dev/shm/* 2>/dev/null | head -20", capture_output=True, text=True, shell=True,- timeout=5+ timeout=5, ) print(result.stdout if result.stdout else " (no output)")-+ except Exception as e: print(f" Error listing /dev/shm: {e}")-+ # Also list using Python for comparison print("\nPython os.listdir():") try: shm_files = [] for f in os.listdir("/dev/shm"):@@ -109,25 +107,27 @@ try: size = os.path.getsize(path) shm_files.append((path, size)) except OSError: shm_files.append((path, -1))-+ # Sort by size descending shm_files.sort(key=lambda x: x[1], reverse=True) total_listed = 0 for path, size in shm_files: if size >= 0: print(f" {path}: {size / (1024 * 1024):.2f} MB") total_listed += size else: print(f" {path}: <unable to get size>")-+ print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB") print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")- print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")-+ print(+ f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"+ )+ if not shm_files: print(" (no files found)") except Exception as e: print(f" Error: {e}")@@ -135,11 +135,11 @@ "/dev/shm/nccl-*", "/dev/shm/torch_*", "/dev/shm/py_shared_memory_*", "/dev/shm/*multiprocessing*", "/dev/shm/vader_segment*", # Open MPI shared memory- "/dev/shm/sem.*", # POSIX semaphores+ "/dev/shm/sem.*", # POSIX semaphores ] total_files = 0 total_bytes_freed = 0
f40e84b to99ded8cCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 05:24:09.711666+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 05:24:42.564679+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 05:24:10.282669+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 05:24:54.082736+00:00@@ -67,33 +67,31 @@ # List ALL files in /dev/shm to see what's consuming space print("\nAll files in /dev/shm (including hidden):") try: import subprocess+ # Use ls -la to see all files including hidden ones result = subprocess.run(- ["ls", "-la", "/dev/shm"],- capture_output=True,- text=True,- timeout=5+ ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5 ) print(result.stdout)-+ # Also run du to see actual disk usage print("\nDisk usage breakdown (du -sh /dev/shm/*):") result = subprocess.run( "du -sh /dev/shm/* 2>/dev/null | head -20", capture_output=True, text=True, shell=True,- timeout=5+ timeout=5, ) print(result.stdout if result.stdout else " (no output)")-+ except Exception as e: print(f" Error listing /dev/shm: {e}")-+ # Also list using Python for comparison print("\nPython os.listdir():") try: shm_files = [] for f in os.listdir("/dev/shm"):@@ -101,25 +99,27 @@ try: size = os.path.getsize(path) shm_files.append((path, size)) except OSError: shm_files.append((path, -1))-+ # Sort by size descending shm_files.sort(key=lambda x: x[1], reverse=True) total_listed = 0 for path, size in shm_files: if size >= 0: print(f" {path}: {size / (1024 * 1024):.2f} MB") total_listed += size else: print(f" {path}: <unable to get size>")-+ print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB") print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")- print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")-+ print(+ f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"+ )+ if not shm_files: print(" (no files found)") except Exception as e: print(f" Error: {e}")@@ -127,11 +127,11 @@ "/dev/shm/nccl-*", "/dev/shm/torch_*", "/dev/shm/py_shared_memory_*", "/dev/shm/*multiprocessing*", "/dev/shm/vader_segment*", # Open MPI shared memory- "/dev/shm/sem.*", # POSIX semaphores+ "/dev/shm/sem.*", # POSIX semaphores ] total_files = 0 total_bytes_freed = 0
99ded8c to6e91c4eCompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 14:38:26.671953+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 14:38:59.549613+00:00@@ -148,11 +148,11 @@ item, options.jetpack == "true", options.limit_pr_builds == "true", ): print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)- filtered_includes.append(item)+ filtered_includes.append(item) distributed_includes.append(create_distributed_config(item)) else: print(f"[DEBUG] FILTERED OUT", file=sys.stderr) # Debug: Show summary
6e91c4e to3e42d12Compare…ting TRT-LLM installation fallback cases
3e42d12 to091c2e4Compare| else: | ||
| logger.setLevel(level) | ||
| ifhas_torchscript_frontend(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If we have the frontend we necessarily have the runtime, I dont think we need to use these APIs
| _LOGGER.setLevel(logging.CRITICAL) | ||
| ifENABLED_FEATURES.torchscript_frontend: | ||
| ifhas_torchscript_frontend(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
lets just remove the has_torchscript_frontend cases
narendasan left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just remove the TS ones since we should be able to handle both with the runtime and then LGTM
This PR