Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Adding rank based logging for torch distributed examples#3897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
apbose wants to merge3 commits intoabose/trt_llm_installation_dist
base:abose/trt_llm_installation_dist
Choose a base branch
Loading
fromabose/trt_llm_installation_changes_debug

Conversation

@apbose
Copy link
Collaborator

This PR

  1. Adds rank based logging for the distributed examples
  2. Corrects the fallback to pytorch case for NCCL converters
  3. This withChanges to TRT-LLM download tool for multigpu distributed case #3830 provides utilities for running distributed tensor parallel examples using torch.distributed

@github-actionsgithub-actionsbot added component: testsIssues re: Tests component: conversionIssues re: Conversion stage component: api [Python]Issues re: Python API component: dynamoIssues relating to the `torch.compile` or `torch._dynamo.export` paths labelsNov 14, 2025
@apboseapbose changed the titleAdding rank based logging for torch distributed examples. Also correc…Adding rank based logging for torch distributed examplesNov 14, 2025
@apboseapbose marked this pull request as draftNovember 14, 2025 00:05
@apboseapbose changed the titleAdding rank based logging for torch distributed examplesAdding rank based logging for torch distributed examples[WIP]Nov 14, 2025
@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from31666e3 to52ae92aCompareNovember 25, 2025 22:59
@apboseapbose changed the titleAdding rank based logging for torch distributed examples[WIP]Adding rank based logging for torch distributed examplesNov 26, 2025
@apboseapbose marked this pull request as ready for reviewNovember 26, 2025 00:28
@apboseapbose changed the base branch frommain toabose/trt_llm_installation_distNovember 26, 2025 00:28
returndevice_mesh,world_size,rank
# Set C++ TensorRT runtime log level based on most verbose handler
# this is similar to set_log_level()
cpp_level=min(file_level_int,console_level_int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Dont we have an API that abstracts needing to detect if the C++ runtime is available? If not we should add one

Copy link
CollaboratorAuthor

@apboseapboseDec 2, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have added a function in _features.py for the above. And also moved all this to logging.py. Let me know if that function placment works

notis_platform_supported_for_trtllm(),
"Skipped on Windows, Jetson and CUDA13: NCCL backend is not supported.",
notis_distributed_nccl_available(),
"Skipped: NCCL backend is notavailable (Windows/Jetson notsupported).",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is it jetson or just Orin?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yeah Orin. Changed to Jetson Orin

Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 00:37:46.920408+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 00:38:18.669710+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary

Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 07:00:24.693914+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 07:00:53.634960+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch fromaa4183e to2ea29e4CompareDecember 2, 2025 15:33
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 15:34:05.984305+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 15:34:37.144980+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from2ea29e4 to6833fecCompareDecember 2, 2025 22:41
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 22:41:27.269191+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 22:41:58.523912+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from6833fec tof8befaeCompareDecember 2, 2025 23:49
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 23:49:16.116928+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-02 23:49:48.815063+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-02 23:49:16.689930+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-02 23:50:00.341840+00:00@@ -74,19 +74,19 @@            try:                size = os.path.getsize(path)                shm_files.append((path, size))            except OSError:                shm_files.append((path, -1))-+        # Sort by size descending        shm_files.sort(key=lambda x: x[1], reverse=True)        for path, size in shm_files:            if size >= 0:                print(f"  {path}: {size / (1024 * 1024):.2f} MB")            else:                print(f"  {path}: <unable to get size>")-+        if not shm_files:            print("  (no files found)")    except Exception as e:        print(f"  Error listing /dev/shm: {e}")

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch fromf8befae tof40e84bCompareDecember 3, 2025 00:43
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 00:44:03.183076+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 00:44:33.293930+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 00:44:03.634077+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 00:44:44.650284+00:00@@ -67,41 +67,39 @@    # List ALL files in /dev/shm to see what's consuming space    print("\nAll files in /dev/shm (including hidden):")    try:        import subprocess+        # Use ls -la to see all files including hidden ones        result = subprocess.run(-            ["ls", "-la", "/dev/shm"],-            capture_output=True,-            text=True,-            timeout=5+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5        )        print(result.stdout)-+        # Also run du to see actual disk usage        print("\nDisk usage breakdown (du -sh /dev/shm/*):")        result = subprocess.run(            ["du", "-sh", "/dev/shm/*"],            capture_output=True,            text=True,            shell=False,-            timeout=5+            timeout=5,        )        # du with glob needs shell=True        result = subprocess.run(            "du -sh /dev/shm/* 2>/dev/null | head -20",            capture_output=True,            text=True,            shell=True,-            timeout=5+            timeout=5,        )        print(result.stdout if result.stdout else "  (no output)")-+    except Exception as e:        print(f"  Error listing /dev/shm: {e}")-+    # Also list using Python for comparison    print("\nPython os.listdir():")    try:        shm_files = []        for f in os.listdir("/dev/shm"):@@ -109,25 +107,27 @@            try:                size = os.path.getsize(path)                shm_files.append((path, size))            except OSError:                shm_files.append((path, -1))-+        # Sort by size descending        shm_files.sort(key=lambda x: x[1], reverse=True)        total_listed = 0        for path, size in shm_files:            if size >= 0:                print(f"  {path}: {size / (1024 * 1024):.2f} MB")                total_listed += size            else:                print(f"  {path}: <unable to get size>")-+        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")-+        print(+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"+        )+        if not shm_files:            print("  (no files found)")    except Exception as e:        print(f"  Error: {e}")@@ -135,11 +135,11 @@        "/dev/shm/nccl-*",        "/dev/shm/torch_*",        "/dev/shm/py_shared_memory_*",        "/dev/shm/*multiprocessing*",        "/dev/shm/vader_segment*",  # Open MPI shared memory-        "/dev/shm/sem.*",           # POSIX semaphores+        "/dev/shm/sem.*",  # POSIX semaphores    ]    total_files = 0    total_bytes_freed = 0

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch fromf40e84b to99ded8cCompareDecember 3, 2025 05:23
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 05:24:09.711666+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 05:24:42.564679+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 05:24:10.282669+00:00+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py2025-12-03 05:24:54.082736+00:00@@ -67,33 +67,31 @@    # List ALL files in /dev/shm to see what's consuming space    print("\nAll files in /dev/shm (including hidden):")    try:        import subprocess+        # Use ls -la to see all files including hidden ones        result = subprocess.run(-            ["ls", "-la", "/dev/shm"],-            capture_output=True,-            text=True,-            timeout=5+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5        )        print(result.stdout)-+        # Also run du to see actual disk usage        print("\nDisk usage breakdown (du -sh /dev/shm/*):")        result = subprocess.run(            "du -sh /dev/shm/* 2>/dev/null | head -20",            capture_output=True,            text=True,            shell=True,-            timeout=5+            timeout=5,        )        print(result.stdout if result.stdout else "  (no output)")-+    except Exception as e:        print(f"  Error listing /dev/shm: {e}")-+    # Also list using Python for comparison    print("\nPython os.listdir():")    try:        shm_files = []        for f in os.listdir("/dev/shm"):@@ -101,25 +99,27 @@            try:                size = os.path.getsize(path)                shm_files.append((path, size))            except OSError:                shm_files.append((path, -1))-+        # Sort by size descending        shm_files.sort(key=lambda x: x[1], reverse=True)        total_listed = 0        for path, size in shm_files:            if size >= 0:                print(f"  {path}: {size / (1024 * 1024):.2f} MB")                total_listed += size            else:                print(f"  {path}: <unable to get size>")-+        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")-+        print(+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"+        )+        if not shm_files:            print("  (no files found)")    except Exception as e:        print(f"  Error: {e}")@@ -127,11 +127,11 @@        "/dev/shm/nccl-*",        "/dev/shm/torch_*",        "/dev/shm/py_shared_memory_*",        "/dev/shm/*multiprocessing*",        "/dev/shm/vader_segment*",  # Open MPI shared memory-        "/dev/shm/sem.*",           # POSIX semaphores+        "/dev/shm/sem.*",  # POSIX semaphores    ]    total_files = 0    total_bytes_freed = 0

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from99ded8c to6e91c4eCompareDecember 3, 2025 14:38
Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 14:38:26.671953+00:00+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py2025-12-03 14:38:59.549613+00:00@@ -148,11 +148,11 @@            item,            options.jetpack == "true",            options.limit_pr_builds == "true",        ):            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)-            filtered_includes.append(item)+            filtered_includes.append(item)            distributed_includes.append(create_distributed_config(item))        else:            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)    # Debug: Show summary

@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from6e91c4e to3e42d12CompareDecember 3, 2025 14:39
@apboseapboseforce-pushed theabose/trt_llm_installation_changes_debug branch from3e42d12 to091c2e4CompareDecember 3, 2025 16:26
else:
logger.setLevel(level)

ifhas_torchscript_frontend():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If we have the frontend we necessarily have the runtime, I dont think we need to use these APIs

_LOGGER.setLevel(logging.CRITICAL)

ifENABLED_FEATURES.torchscript_frontend:
ifhas_torchscript_frontend():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lets just remove the has_torchscript_frontend cases

Copy link
Collaborator

@narendasannarendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just remove the TS ones since we should be able to handle both with the runtime and then LGTM

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@github-actionsgithub-actions[bot]github-actions[bot] requested changes

@narendasannarendasannarendasan approved these changes

Assignees

No one assigned

Labels

cla signedcomponent: api [Python]Issues re: Python APIcomponent: conversionIssues re: Conversion stagecomponent: dynamoIssues relating to the `torch.compile` or `torch._dynamo.export` pathscomponent: testsIssues re: Testscomponent: torch_compile

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@apbose@narendasan

[8]ページ先頭

©2009-2025 Movatter.jp