Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TMA pointwise scheduler tests#5565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
liqiangxl wants to merge25 commits intomain
base:main
Choose a base branch
Loading
fromllu/pt5_test
Draft

TMA pointwise scheduler tests#5565

liqiangxl wants to merge25 commits intomainfromllu/pt5_test

Conversation

@liqiangxl
Copy link
Collaborator

enable and run ci tests

@liqiangxl
Copy link
CollaboratorAuthor

!test

@github-actions
Copy link

github-actionsbot commentedNov 20, 2025
edited by xwang233
Loading

Review updated until commita21ff37

Description

  • Implement TMA (Tensor Memory Accelerator) scheduler for pointwise operations

  • Add automatic TMA detection and fallback to non-TMA scheduler

  • Refactor common scheduling utilities into pointwise_utils

  • Add comprehensive TMA tests covering broadcast, vectorization scenarios

Changes walkthrough

Relevant files
Enhancement
9 files
pointwise.cpp
Add TMA auto-detection and scheduler integration                 
+62/-4   
pointwise_non_tma.cpp
Refactor to use common scheduling utilities                           
+32/-131
pointwise_tma.cpp
Implement complete TMA scheduler with heuristics                 
+385/-4 
pointwise_utils.cpp
Add common break point and block/grid utilities                   
+181/-0 
utils.cpp
Add TMA utility functions and cacheable uses detection     
+105/-46
pointwise_heuristic.h
Add TMA-specific parameters to PointwiseParams                     
+7/-0     
pointwise_tma.h
Update TMA scheduler function signatures                                 
+2/-1     
pointwise_utils.h
Add break point and block/grid configuration structures   
+38/-0   
utils.h
Add TMA utility function declarations                                       
+31/-6   
Tests
4 files
test_gpu2.cpp
Update vectorization test for TMA compatibility                   
+9/-1     
test_pointwise.cpp
Add comprehensive TMA scheduler tests                                       
+432/-62
test_resize.cpp
Update resize test for TMA/non-TMA handling                           
+9/-7     
test_vectorization.cpp
Update vectorization tests for TMA support                             
+3/-3     

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Debug Output

There are debug print statements (std::cout) at lines 294-295 that should be removed or properly guarded with debug flags before merging to main.

std::cout <<"reference_tv:" << reference_tv->toString() << std::endl;reference_tv->printTransforms();
Test Naming

Line 463 has a typo in the test name: "VIssue1567ectorizationFactorAnalysisCase3" should be "Issue1567VectorizationFactorAnalysisCase3"

at::Tensor t1 = at::randn({512,1024,2}, options);// NOTE force pointwise scheduler here just for testing purpose
Commented Code

Lines 400-402 contain commented-out code that should be cleaned up or removed before merging.

// bool use_tma = mayUseTma(prop, runtime_info) &&// isOptionEnabled(EnableOption::TmaPointwise); for CI testing, use tma always// if possible

Test failures

  • (Medium, 90)NVFuser internal assert (Unknown tensor map data type) in test_direct_ops opinfo suite

    Test NameGB200H100Source
    tests.python.direct.test_repro.test_issue1277
    tests.python.opinfo.test_direct_ops.test_correctness_abs_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_abs_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_acos_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_acos_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_acosh_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_acosh_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_add_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_add_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_asin_complex128
    ... with 57 more test failures omitted. Check internal logs.
  • (Medium, 46)NVFuser internal assert: Unknown tensor map data type on complex dtype ops (opinfo direct & UnaryTests)

    Test NameGB200H100Source
    UnaryTests/UnaryTest.Neg/std__complex_float_Link
    tests.python.opinfo.test_direct_ops.test_correctness_abs_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_acos_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_add_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_add_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_asin_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_asin_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_asinh_complex128
    tests.python.opinfo.test_direct_ops.test_correctness_asinh_complex64
    tests.python.opinfo.test_direct_ops.test_correctness_atan_complex128
    ... with 35 more test failures omitted. Check internal logs.
  • (Medium, 12)NVFuser internal assertion failures in BlockQuantizationSchedulingTestSuite and MatmulSchedulerTest

    Test NameGB200Source
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_WithGlobalScale_NoSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_128x64_NoGlobalScale_WithSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_NoGlobalScale_NoSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_WithGlobalScale_WithSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_WithGlobalScale_NoSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_NoGlobalScale_NoSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_WithGlobalScale_WithSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_128x64_WithGlobalScale_NoSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_2048x128_NoGlobalScale_WithSwizzleLink
    BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_2048x2048_NoGlobalScale_NoSwizzleLink
    ... with 2 more test failures omitted. Check internal logs.
  • (Medium, 9)Multiple NVFuser internal assertion failures across grouped_mm, multidevice matmul/transformer, and thunderfx MoE tests

    Test NameGB200GB200 (dist.)H100H100 (dist.)Source
    tests.python.direct.test_with_id_model_indexer.test_layout_op_and_cutlass_nvfp4_grouped_mm[out_dtype=torch.bfloat16-tokens_per_expert_neg_one=[115, 144, 8]-config=[1024, 128, 256]]
    tests.python.multidevice.test_matmul.test_linear_reduce_scatter
    tests.python.multidevice.test_matmul.test_sequence_parallel_linear
    tests.python.multidevice.test_transformer.test_grouped_mlp
    tests.python.test_moe.test_llama4_moe_thunderfx
  • (Medium, 8)NVFuser TMA analysis internal asserts (merge-discontiguous / extent divisibility) in PointwiseTest, ResizeTest, matmul_stride, and issue1953 suites

    Test NameGB200H100Source
    PointwiseTest.VIssue1567ectorizationFactorAnalysisCase3Link
    ResizeTest.PadAndCacheUsesLink
    tests.python.direct.test_matmul.test_matmul_stride
    tests.python.direct.test_repro.test_issue1953
  • (Medium, 2)nvFuser internal input-size assert in test_schedule_ops::TestScheduleOps.test_concretize_reshape_pointwise

    Test NameGB200H100Source
    tests.python.test_schedule_ops.TestScheduleOps.test_concretize_reshape_pointwise
  • (Medium, 2)nvFuser split-after-parallelization assertion in multidevice transformer tests

    Test NameGB200H100Source
    tests.python.multidevice.test_transformer.test_grouped_mlp
  • (Medium, 2)Heuristic string mismatch in test_tutorial_compute_heuristics_and_schedule

    Test NameGB200H100Source
    tests.python.direct.test_tutorial.test_tutorial_compute_heuristics_and_schedule
  • (Medium, 1)nvFuser pointwise heuristic unroll factor mismatch in PointwiseTest

    Test NameGB200Source
    PointwiseTest.Heuristicst1Compute2Unroll4Link

@liqiangxl
Copy link
CollaboratorAuthor

!test

@liqiangxl
Copy link
CollaboratorAuthor

!test

@liqiangxl
Copy link
CollaboratorAuthor

!test

@liqiangxl
Copy link
CollaboratorAuthor

!test

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@liqiangxl

[8]ページ先頭

©2009-2025 Movatter.jp