Rate this Page

Note

Go to the endto download the full example code.

Explicit horizontal fusion with foreach_map and torch.compile#

Author:Michael Lazos

Horizontal fusion is a key optimization in ML compilers. In eager,

this is typically expressed using the torch._foreach* ops which parallelizesoperations across a list of tensors. However, supporting all possible permutationsof arguments is quite difficult (e.g. mixtures of scalars and lists). Foreach_mapallows conversion of any pointwise op intorch to a horiztonally fused foreachvariant. In this tutorial, we will demonstrate how to implement the Adam optimizerwithforeach_map to generate a fully fused kernel.

Note

This recipe describes a prototype feature. Prototype features are typicallyat an early stage for feedback and testing and are subject to change.

Prerequisites#

  • PyTorch v2.7.0 or later

Model Setup#

For this example, we’ll use a simple sequence of linear layers.We instantiate an independent copy to compare the two optimizer implementations.

importtorch# exit cleanly if we are on a device that doesn't support ``torch.compile``iftorch.cuda.get_device_capability()<(7,0):print("Exiting because torch.compile is not supported on this device.")importsyssys.exit(0)# Create simple modelmodel=torch.nn.Sequential(*[torch.nn.Linear(1024,1024,False,device="cuda")for_inrange(10)])model_copy=torch.nn.Sequential(*[torch.nn.Linear(1024,1024,False,device="cuda")for_inrange(10)])input=torch.rand(1024,device="cuda")# run forward passoutput=model(input)output_copy=model_copy(input)# run backward to populate the grads for our optimizer belowoutput.sum().backward()output_copy.sum().backward()

Helper functions for foreach_map implementation#

In this section, we’ll begin our implementation of the Adam optimizer.

fromtorch._higher_order_ops.foreach_mapimportforeach_map# Helper function to extract optimizer states from a torch.optim.Adam instancedefget_inputs(optim):steps=[]params=[]grads=[]exp_avgs=[]exp_avg_sqs=[]forgroupinoptim.param_groups:forpingroup["params"]:params.append(p)grads.append(p.grad)state=optim.state[p]exp_avgs.append(state["exp_avg"])exp_avg_sqs.append(state["exp_avg_sq"])steps.append(state["step"])returnsteps,params,exp_avgs,exp_avg_sqs# Functions to update the different optimizer statesdefupdate_exp_avg_sq(exp_avg_sq,grad,beta2):returnexp_avg_sq.mul(beta2).addcmul(grad,grad,value=1-beta2)defupdate_param(param,step,exp_avg,exp_avg_sq,beta1,beta2,lr,eps):bias_correction1=1-torch.pow(beta1,step)bias_correction2=(1-torch.pow(beta2,step)).sqrt()step_size=(lr/bias_correction1).neg()denom=(exp_avg_sq.sqrt()/(bias_correction2*step_size)).add(eps/step_size)returntorch.add(param,torch.div(exp_avg,denom))# Our full Adam implementationdefforeach_map_adam(steps,params,exp_avgs,exp_avg_sqs,weight_decay=0,beta1=0.9,beta2=0.999,lr=1e-3,eps=1e-8,):withtorch.no_grad():grads=[param.gradforparaminparams]# update stepupdated_steps=foreach_map(lambdax:x+1,steps)torch._foreach_copy_(steps,updated_steps)ifweight_decay!=0:foreach_map(torch.add,(grads,),alpha=weight_decay)# Higher-order operators (HOPs) cannot have multiple outputs at the moment# need to call foreach_map once for each outputexp_avgs_updated=foreach_map(torch.lerp,exp_avgs,grads,1-beta1)exp_avgs_sq_updated=foreach_map(update_exp_avg_sq,exp_avg_sqs,grads,beta2)params_updated=foreach_map(update_param,params,steps,exp_avgs_updated,exp_avgs_sq_updated,beta1,beta2,lr,eps,)# Higher-order operators (HOPs) don't support input mutation today# so manually  update the states in-placetorch._foreach_copy_(exp_avgs,exp_avgs_updated)torch._foreach_copy_(exp_avg_sqs,exp_avgs_sq_updated)torch._foreach_copy_(params,params_updated)return

Setting up and running the compiled kernel#

In this section, we’ll run our Adam optimizerand compare the results

Note

torch.compile is only supported on CUDA devices that have a compute capability of 7.0 or higher.

opt_eager=torch.optim.Adam(model.parameters(),lr=torch.tensor(0.01))opt_eager_copy=torch.optim.Adam(model_copy.parameters(),lr=torch.tensor(0.01))# warm up the optimizer state dictopt_eager.step()opt_eager_copy.step()inputs=get_inputs(opt_eager_copy)compiled_adam=torch.compile(foreach_map_adam)# optionally view the output codetorch._logging.set_logs(output_code=True)# Warmup runs to compile the functionfor_inrange(5):opt_eager.step()compiled_adam(*inputs)foreager_p,compile_pinzip(opt_eager.param_groups[0]["params"],opt_eager_copy.param_groups[0]["params"]):torch.allclose(eager_p,compile_p)# Benchmark performance# Let's define a helpful benchmarking function:importtorch.utils.benchmarkasbenchmarkdefbenchmark_torch_function_in_microseconds(f,*args,**kwargs):t0=benchmark.Timer(stmt="f(*args, **kwargs)",globals={"args":args,"kwargs":kwargs,"f":f})returnt0.blocked_autorange().mean*1e6eager_runtime=benchmark_torch_function_in_microseconds(opt_eager.step)compiled_runtime=benchmark_torch_function_in_microseconds(lambda:compiled_adam(*inputs))asserteager_runtime>compiled_runtimeprint(f"eager runtime:{eager_runtime}us")print(f"compiled runtime:{compiled_runtime}us")
V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] Output code:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] # AOT ID: ['0_inference']V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_intV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import torchV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import mathV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import randomV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import osV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import tempfileV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from math import inf, nanV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from cmath import nanjV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooksV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.utils import maybe_profileV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as alignV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch import device, empty_stridedV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompileV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernelsV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_streamV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import tritonV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import triton.language as tlV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graphV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_streamV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] aten = torch.ops.atenV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] inductor_ops = torch.ops.inductorV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] _quantized = torch.ops._quantizedV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_strideV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] assert_alignment = torch._C._dynamo.guards.assert_alignmentV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpuV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_cpu_pinned = torch._C._dynamo.guards._empty_strided_cpu_pinnedV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cudaV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpuV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_mtia = torch._C._dynamo.guards._empty_strided_mtiaV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensorV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_poolV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] async_compile = AsyncCompile()V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2pV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] cpp_fused__foreach_copy_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], r'''V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] #include <torch/csrc/inductor/cpp_prefix.h>V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] extern "C"  void  kernel(const float* in_ptr0,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr1,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr2,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr3,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr4,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr5,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr6,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr7,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr8,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        const float* in_ptr9,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr0,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr1,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr2,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr3,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr4,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr5,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr6,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr7,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr8,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                        float* out_ptr9)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr0[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr1[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr2[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr3[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr4[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr5[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr6[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr7[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr8[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             {V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]                 out_ptr9[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] }V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] ''')V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] # kernel path: /tmp/torchinductor_ci-user/gq/cgqhtbf2gwgsfmgfs2f4ajlpqjhrdhscqtxcif6oteglzjt3a3rq.pyV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] # Source node to ATen node mapping:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] triton_for_fused_1 = async_compile.triton('triton_for_fused_1', '''V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import tritonV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] import triton.language as tlV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristicsV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_mathV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DevicePropertiesV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] @triton_heuristics.foreach(V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     filename=__file__,V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr3': '*fp32', 'out_ptr4': '*fp32', 'out_ptr5': '*fp32', 'out_ptr9': '*fp32', 'out_ptr10': '*fp32', 'out_ptr11': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr21': '*fp32', 'out_ptr22': '*fp32', 'out_ptr23': '*fp32', 'out_ptr27': '*fp32', 'out_ptr28': '*fp32', 'out_ptr29': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr39': '*fp32', 'out_ptr40': '*fp32', 'out_ptr41': '*fp32', 'out_ptr45': '*fp32', 'out_ptr46': '*fp32', 'out_ptr47': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr57': '*fp32', 'out_ptr58': '*fp32', 'out_ptr59': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, max_threads_per_block=1024, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_1', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr10', 'out_ptr11', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr21', 'out_ptr22', 'out_ptr23', 'out_ptr27', 'out_ptr28', 'out_ptr29', 'out_ptr3', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr39', 'out_ptr4', 'out_ptr40', 'out_ptr41', 'out_ptr45', 'out_ptr46', 'out_ptr47', 'out_ptr5', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr57', 'out_ptr58', 'out_ptr59', 'out_ptr9'], 'backend_hash': '130560DF8C676AFCBC44717C6A9B3C6A2EC6174C11ECC01A816D2F75FFBF9BD0', 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'deterministic': False, 'force_filter_reduction_configs': False, 'are_deterministic_algorithms_enabled': False},V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] )V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] @triton.jitV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] def triton_for_fused_1(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr3, out_ptr4, out_ptr5, out_ptr9, out_ptr10, out_ptr11, out_ptr15, out_ptr16, out_ptr17, out_ptr21, out_ptr22, out_ptr23, out_ptr27, out_ptr28, out_ptr29, out_ptr33, out_ptr34, out_ptr35, out_ptr39, out_ptr40, out_ptr41, out_ptr45, out_ptr46, out_ptr47, out_ptr51, out_ptr52, out_ptr53, out_ptr57, out_ptr58, out_ptr59):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     pid = tl.program_id(0)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     XBLOCK: tl.constexpr = 1024V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_0 = tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     if pid < num_xblocks_0:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pidV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x0 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp0 = tl.load(in_ptr0 + (x0), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp1 = tl.load(in_ptr1 + (x0), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp8 = tl.load(in_ptr2 + (x0), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp15 = tl.load(in_ptr3 + (x0), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp17 = in_ptr4V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp2 = tmp0 - tmp1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp3 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp4 = tmp3 * tmp2V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp5 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp6 = tl.where(tmp5, tmp0, tmp1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp7 = tmp4 + tmp6V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp9 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp10 = tmp8 * tmp9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp11 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp12 = tmp0 * tmp11V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp13 = tmp12 * tmp0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp14 = tmp10 + tmp13V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp16 = tl.sqrt_rn(tmp14)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp18 = libdevice.pow(tmp9, tmp17)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp19 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp20 = tmp19 - tmp18V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp21 = tl.sqrt_rn(tmp20)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp22 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp23 = libdevice.pow(tmp22, tmp17)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp24 = tmp19 - tmp23V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp25 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp26 = (tmp25 / tmp24)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp27 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp28 = tmp26 * tmp27V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp29 = -tmp28V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp30 = tmp21 * tmp29V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp31 = (tmp16 / tmp30)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp32 = (tmp25 / tmp29)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp33 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp34 = tmp32 * tmp33V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp35 = tmp31 + tmp34V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp36 = (tmp7 / tmp35)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp37 = tmp15 + tmp36V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr3 + (x0), tmp7, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr4 + (x0), tmp14, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr5 + (x0), tmp37, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_1:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x1 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp38 = tl.load(in_ptr5 + (x1), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp39 = tl.load(in_ptr6 + (x1), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp46 = tl.load(in_ptr7 + (x1), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp53 = tl.load(in_ptr8 + (x1), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp55 = in_ptr9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp40 = tmp38 - tmp39V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp41 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp42 = tmp41 * tmp40V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp43 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp44 = tl.where(tmp43, tmp38, tmp39)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp45 = tmp42 + tmp44V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp47 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp48 = tmp46 * tmp47V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp49 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp50 = tmp38 * tmp49V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp51 = tmp50 * tmp38V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp52 = tmp48 + tmp51V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp54 = tl.sqrt_rn(tmp52)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp56 = libdevice.pow(tmp47, tmp55)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp57 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp58 = tmp57 - tmp56V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp59 = tl.sqrt_rn(tmp58)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp60 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp61 = libdevice.pow(tmp60, tmp55)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp62 = tmp57 - tmp61V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp63 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp64 = (tmp63 / tmp62)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp65 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp66 = tmp64 * tmp65V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp67 = -tmp66V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp68 = tmp59 * tmp67V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp69 = (tmp54 / tmp68)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp70 = (tmp63 / tmp67)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp71 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp72 = tmp70 * tmp71V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp73 = tmp69 + tmp72V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp74 = (tmp45 / tmp73)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp75 = tmp53 + tmp74V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr9 + (x1), tmp45, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr10 + (x1), tmp52, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr11 + (x1), tmp75, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_2:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x2 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp76 = tl.load(in_ptr10 + (x2), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp77 = tl.load(in_ptr11 + (x2), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp84 = tl.load(in_ptr12 + (x2), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp91 = tl.load(in_ptr13 + (x2), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp93 = in_ptr14V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp78 = tmp76 - tmp77V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp79 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp80 = tmp79 * tmp78V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp81 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp82 = tl.where(tmp81, tmp76, tmp77)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp83 = tmp80 + tmp82V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp85 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp86 = tmp84 * tmp85V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp87 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp88 = tmp76 * tmp87V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp89 = tmp88 * tmp76V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp90 = tmp86 + tmp89V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp92 = tl.sqrt_rn(tmp90)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp94 = libdevice.pow(tmp85, tmp93)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp95 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp96 = tmp95 - tmp94V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp97 = tl.sqrt_rn(tmp96)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp98 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp99 = libdevice.pow(tmp98, tmp93)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp100 = tmp95 - tmp99V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp101 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp102 = (tmp101 / tmp100)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp103 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp104 = tmp102 * tmp103V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp105 = -tmp104V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp106 = tmp97 * tmp105V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp107 = (tmp92 / tmp106)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp108 = (tmp101 / tmp105)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp109 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp110 = tmp108 * tmp109V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp111 = tmp107 + tmp110V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp112 = (tmp83 / tmp111)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp113 = tmp91 + tmp112V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr15 + (x2), tmp83, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr16 + (x2), tmp90, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr17 + (x2), tmp113, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_3:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_2V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x3 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp114 = tl.load(in_ptr15 + (x3), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp115 = tl.load(in_ptr16 + (x3), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp122 = tl.load(in_ptr17 + (x3), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp129 = tl.load(in_ptr18 + (x3), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp131 = in_ptr19V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp116 = tmp114 - tmp115V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp117 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp118 = tmp117 * tmp116V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp119 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp120 = tl.where(tmp119, tmp114, tmp115)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp121 = tmp118 + tmp120V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp123 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp124 = tmp122 * tmp123V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp125 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp126 = tmp114 * tmp125V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp127 = tmp126 * tmp114V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp128 = tmp124 + tmp127V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp130 = tl.sqrt_rn(tmp128)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp132 = libdevice.pow(tmp123, tmp131)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp133 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp134 = tmp133 - tmp132V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp135 = tl.sqrt_rn(tmp134)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp136 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp137 = libdevice.pow(tmp136, tmp131)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp138 = tmp133 - tmp137V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp139 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp140 = (tmp139 / tmp138)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp141 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp142 = tmp140 * tmp141V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp143 = -tmp142V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp144 = tmp135 * tmp143V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp145 = (tmp130 / tmp144)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp146 = (tmp139 / tmp143)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp147 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp148 = tmp146 * tmp147V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp149 = tmp145 + tmp148V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp150 = (tmp121 / tmp149)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp151 = tmp129 + tmp150V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr21 + (x3), tmp121, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr22 + (x3), tmp128, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr23 + (x3), tmp151, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_4:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_3V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x4 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp152 = tl.load(in_ptr20 + (x4), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp153 = tl.load(in_ptr21 + (x4), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp160 = tl.load(in_ptr22 + (x4), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp167 = tl.load(in_ptr23 + (x4), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp169 = in_ptr24V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp154 = tmp152 - tmp153V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp155 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp156 = tmp155 * tmp154V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp157 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp158 = tl.where(tmp157, tmp152, tmp153)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp159 = tmp156 + tmp158V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp161 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp162 = tmp160 * tmp161V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp163 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp164 = tmp152 * tmp163V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp165 = tmp164 * tmp152V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp166 = tmp162 + tmp165V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp168 = tl.sqrt_rn(tmp166)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp170 = libdevice.pow(tmp161, tmp169)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp171 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp172 = tmp171 - tmp170V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp173 = tl.sqrt_rn(tmp172)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp174 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp175 = libdevice.pow(tmp174, tmp169)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp176 = tmp171 - tmp175V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp177 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp178 = (tmp177 / tmp176)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp179 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp180 = tmp178 * tmp179V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp181 = -tmp180V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp182 = tmp173 * tmp181V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp183 = (tmp168 / tmp182)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp184 = (tmp177 / tmp181)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp185 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp186 = tmp184 * tmp185V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp187 = tmp183 + tmp186V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp188 = (tmp159 / tmp187)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp189 = tmp167 + tmp188V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr27 + (x4), tmp159, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr28 + (x4), tmp166, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr29 + (x4), tmp189, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_5:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_4V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x5 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp190 = tl.load(in_ptr25 + (x5), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp191 = tl.load(in_ptr26 + (x5), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp198 = tl.load(in_ptr27 + (x5), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp205 = tl.load(in_ptr28 + (x5), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp207 = in_ptr29V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp192 = tmp190 - tmp191V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp193 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp194 = tmp193 * tmp192V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp195 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp196 = tl.where(tmp195, tmp190, tmp191)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp197 = tmp194 + tmp196V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp199 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp200 = tmp198 * tmp199V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp201 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp202 = tmp190 * tmp201V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp203 = tmp202 * tmp190V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp204 = tmp200 + tmp203V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp206 = tl.sqrt_rn(tmp204)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp208 = libdevice.pow(tmp199, tmp207)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp209 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp210 = tmp209 - tmp208V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp211 = tl.sqrt_rn(tmp210)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp212 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp213 = libdevice.pow(tmp212, tmp207)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp214 = tmp209 - tmp213V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp215 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp216 = (tmp215 / tmp214)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp217 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp218 = tmp216 * tmp217V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp219 = -tmp218V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp220 = tmp211 * tmp219V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp221 = (tmp206 / tmp220)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp222 = (tmp215 / tmp219)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp223 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp224 = tmp222 * tmp223V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp225 = tmp221 + tmp224V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp226 = (tmp197 / tmp225)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp227 = tmp205 + tmp226V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr33 + (x5), tmp197, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr34 + (x5), tmp204, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr35 + (x5), tmp227, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_6:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_5V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x6 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp228 = tl.load(in_ptr30 + (x6), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp229 = tl.load(in_ptr31 + (x6), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp236 = tl.load(in_ptr32 + (x6), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp243 = tl.load(in_ptr33 + (x6), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp245 = in_ptr34V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp230 = tmp228 - tmp229V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp231 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp232 = tmp231 * tmp230V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp233 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp234 = tl.where(tmp233, tmp228, tmp229)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp235 = tmp232 + tmp234V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp237 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp238 = tmp236 * tmp237V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp239 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp240 = tmp228 * tmp239V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp241 = tmp240 * tmp228V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp242 = tmp238 + tmp241V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp244 = tl.sqrt_rn(tmp242)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp246 = libdevice.pow(tmp237, tmp245)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp247 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp248 = tmp247 - tmp246V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp249 = tl.sqrt_rn(tmp248)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp250 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp251 = libdevice.pow(tmp250, tmp245)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp252 = tmp247 - tmp251V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp253 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp254 = (tmp253 / tmp252)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp255 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp256 = tmp254 * tmp255V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp257 = -tmp256V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp258 = tmp249 * tmp257V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp259 = (tmp244 / tmp258)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp260 = (tmp253 / tmp257)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp261 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp262 = tmp260 * tmp261V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp263 = tmp259 + tmp262V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp264 = (tmp235 / tmp263)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp265 = tmp243 + tmp264V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr39 + (x6), tmp235, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr40 + (x6), tmp242, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr41 + (x6), tmp265, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_7:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_6V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x7 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp266 = tl.load(in_ptr35 + (x7), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp267 = tl.load(in_ptr36 + (x7), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp274 = tl.load(in_ptr37 + (x7), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp281 = tl.load(in_ptr38 + (x7), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp283 = in_ptr39V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp268 = tmp266 - tmp267V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp269 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp270 = tmp269 * tmp268V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp271 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp272 = tl.where(tmp271, tmp266, tmp267)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp273 = tmp270 + tmp272V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp275 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp276 = tmp274 * tmp275V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp277 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp278 = tmp266 * tmp277V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp279 = tmp278 * tmp266V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp280 = tmp276 + tmp279V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp282 = tl.sqrt_rn(tmp280)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp284 = libdevice.pow(tmp275, tmp283)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp285 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp286 = tmp285 - tmp284V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp287 = tl.sqrt_rn(tmp286)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp288 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp289 = libdevice.pow(tmp288, tmp283)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp290 = tmp285 - tmp289V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp291 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp292 = (tmp291 / tmp290)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp293 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp294 = tmp292 * tmp293V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp295 = -tmp294V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp296 = tmp287 * tmp295V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp297 = (tmp282 / tmp296)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp298 = (tmp291 / tmp295)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp299 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp300 = tmp298 * tmp299V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp301 = tmp297 + tmp300V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp302 = (tmp273 / tmp301)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp303 = tmp281 + tmp302V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr45 + (x7), tmp273, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr46 + (x7), tmp280, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr47 + (x7), tmp303, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_8:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_7V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x8 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp304 = tl.load(in_ptr40 + (x8), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp305 = tl.load(in_ptr41 + (x8), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp312 = tl.load(in_ptr42 + (x8), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp319 = tl.load(in_ptr43 + (x8), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp321 = in_ptr44V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp306 = tmp304 - tmp305V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp307 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp308 = tmp307 * tmp306V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp309 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp310 = tl.where(tmp309, tmp304, tmp305)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp311 = tmp308 + tmp310V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp313 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp314 = tmp312 * tmp313V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp315 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp316 = tmp304 * tmp315V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp317 = tmp316 * tmp304V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp318 = tmp314 + tmp317V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp320 = tl.sqrt_rn(tmp318)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp322 = libdevice.pow(tmp313, tmp321)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp323 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp324 = tmp323 - tmp322V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp325 = tl.sqrt_rn(tmp324)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp326 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp327 = libdevice.pow(tmp326, tmp321)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp328 = tmp323 - tmp327V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp329 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp330 = (tmp329 / tmp328)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp331 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp332 = tmp330 * tmp331V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp333 = -tmp332V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp334 = tmp325 * tmp333V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp335 = (tmp320 / tmp334)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp336 = (tmp329 / tmp333)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp337 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp338 = tmp336 * tmp337V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp339 = tmp335 + tmp338V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp340 = (tmp311 / tmp339)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp341 = tmp319 + tmp340V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr51 + (x8), tmp311, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr52 + (x8), tmp318, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr53 + (x8), tmp341, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     elif pid < num_xblocks_9:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         pid_offset = pid - num_xblocks_8V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xnumel = 1048576V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         r0_numel = 1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         x9 = xindexV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp342 = tl.load(in_ptr45 + (x9), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp343 = tl.load(in_ptr46 + (x9), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp350 = tl.load(in_ptr47 + (x9), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp357 = tl.load(in_ptr48 + (x9), None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp359 = in_ptr49V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp344 = tmp342 - tmp343V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp345 = 0.10000000149011612V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp346 = tmp345 * tmp344V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp347 = tl.full([1], False, tl.int1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp348 = tl.where(tmp347, tmp342, tmp343)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp349 = tmp346 + tmp348V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp351 = 0.999V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp352 = tmp350 * tmp351V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp353 = 0.0010000000000000009V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp354 = tmp342 * tmp353V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp355 = tmp354 * tmp342V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp356 = tmp352 + tmp355V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp358 = tl.sqrt_rn(tmp356)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp360 = libdevice.pow(tmp351, tmp359)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp361 = 1.0V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp362 = tmp361 - tmp360V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp363 = tl.sqrt_rn(tmp362)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp364 = 0.9V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp365 = libdevice.pow(tmp364, tmp359)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp366 = tmp361 - tmp365V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp367 = tl.full([1], 1, tl.int32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp368 = (tmp367 / tmp366)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp369 = 0.001V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp370 = tmp368 * tmp369V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp371 = -tmp370V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp372 = tmp363 * tmp371V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp373 = (tmp358 / tmp372)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp374 = (tmp367 / tmp371)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp375 = 1e-08V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp376 = tmp374 * tmp375V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp377 = tmp373 + tmp376V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp378 = (tmp349 / tmp377)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tmp379 = tmp357 + tmp378V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr57 + (x9), tmp349, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr58 + (x9), tmp356, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         tl.store(out_ptr59 + (x9), tmp379, None)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     else:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         passV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] ''', device_str='cuda')V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] async_compile.wait(globals())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] del async_compileV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] class Runner:V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     def __init__(self, partitions):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         self.partitions = partitionsV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     def recursively_apply_fns(self, fns):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         new_callables = []V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         for fn, c in zip(fns, self.partitions):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             new_callables.append(fn(c))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         self.partitions = new_callablesV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     def call(self, args):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = argsV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         args.clear()V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg0_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg1_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg2_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg3_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg4_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg5_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg6_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg7_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg8_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg9_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg10_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg11_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg12_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg13_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg14_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg15_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg16_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg17_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg18_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg19_1, (), ())V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg20_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg21_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg22_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg23_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg24_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg25_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg26_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg27_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg28_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg29_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg30_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg31_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg32_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg33_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg34_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg35_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg36_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg37_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg38_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg39_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg40_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg41_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg42_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg43_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg44_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg45_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg46_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg47_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg48_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         assert_size_stride(arg49_1, (1024, 1024), (1024, 1))V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         cpp_fused__foreach_copy_0(arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         with torch.cuda._DeviceGuard(0):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             torch.cuda.set_device(0)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             # Unsorted Source Nodes: [], Original ATen: []V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             stream0 = get_raw_stream(0)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             triton_for_fused_1.run(arg30_1, arg20_1, arg40_1, arg0_1, arg10_1.item(), arg31_1, arg21_1, arg41_1, arg1_1, arg11_1.item(), arg32_1, arg22_1, arg42_1, arg2_1, arg12_1.item(), arg33_1, arg23_1, arg43_1, arg3_1, arg13_1.item(), arg34_1, arg24_1, arg44_1, arg4_1, arg14_1.item(), arg35_1, arg25_1, arg45_1, arg5_1, arg15_1.item(), arg36_1, arg26_1, arg46_1, arg6_1, arg16_1.item(), arg37_1, arg27_1, arg47_1, arg7_1, arg17_1.item(), arg38_1, arg28_1, arg48_1, arg8_1, arg18_1.item(), arg39_1, arg29_1, arg49_1, arg9_1, arg19_1.item(), arg20_1, arg40_1, arg0_1, arg21_1, arg41_1, arg1_1, arg22_1, arg42_1, arg2_1, arg23_1, arg43_1, arg3_1, arg24_1, arg44_1, arg4_1, arg25_1, arg45_1, arg5_1, arg26_1, arg46_1, arg6_1, arg27_1, arg47_1, arg7_1, arg28_1, arg48_1, arg8_1, arg29_1, arg49_1, arg9_1, stream=stream0)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg0_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg10_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg11_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg12_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg13_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg14_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg15_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg16_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg17_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg18_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg19_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg1_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg20_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg21_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg22_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg23_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg24_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg25_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg26_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg27_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg28_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg29_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg2_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg30_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg31_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg32_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg33_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg34_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg35_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg36_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg37_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg38_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg39_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg3_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg40_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg41_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg42_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg43_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg44_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg45_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg46_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg47_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg48_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg49_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg4_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg5_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg6_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg7_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg8_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]             del arg9_1V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]         return ()V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] runner = Runner(partitions=[])V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] call = runner.callV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] recursively_apply_fns = runner.recursively_apply_fnsV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     from torch._dynamo.testing import rand_stridedV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     from torch._inductor.utils import print_performanceV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg10_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg11_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg12_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg13_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg14_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg15_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg16_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg17_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg18_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg19_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg20_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg21_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg22_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg23_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg24_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg25_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg26_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg27_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg28_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg29_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code] if __name__ == "__main__":V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_mainV0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)V0219 17:08:20.673000 23788 torch/_inductor/graph.py:2469] [0/0] [__output_code]V0219 17:08:20.724000 23788 torch/_inductor/graph.py:2480] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/vt/cvtdpmeiofjorzqp4lqo47274rqf4zsrel53eevwsmtsjymolnx3.pyI0219 17:08:21.972000 23788 torch/_inductor/graph.py:2440] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/vt/cvtdpmeiofjorzqp4lqo47274rqf4zsrel53eevwsmtsjymolnx3.pyV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] Output code:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] # AOT ID: ['1_inference']V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from ctypes import c_void_p, c_long, c_intV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import torchV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import mathV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import randomV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import osV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import tempfileV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from math import inf, nanV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from cmath import nanjV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.hooks import run_intermediate_hooksV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.utils import maybe_profileV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.codegen.memory_planning import _align as alignV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch import device, empty_stridedV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.async_compile import AsyncCompileV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.select_algorithm import extern_kernelsV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_streamV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import tritonV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import triton.language as tlV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graphV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_streamV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] aten = torch.ops.atenV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] inductor_ops = torch.ops.inductorV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] _quantized = torch.ops._quantizedV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_strideV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] assert_alignment = torch._C._dynamo.guards.assert_alignmentV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpuV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_cpu_pinned = torch._C._dynamo.guards._empty_strided_cpu_pinnedV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cudaV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpuV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_mtia = torch._C._dynamo.guards._empty_strided_mtiaV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensorV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_poolV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] async_compile = AsyncCompile()V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2pV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] cpp_fused__foreach_copy_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], r'''V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] #include <torch/csrc/inductor/cpp_prefix.h>V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] extern "C"  void  kernel(const float* in_ptr0,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr1,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr2,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr3,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr4,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr5,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr6,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr7,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr8,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        const float* in_ptr9,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr0,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr1,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr2,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr3,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr4,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr5,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr6,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr7,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr8,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                        float* out_ptr9)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr0[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr1[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr2[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr3[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr4[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr5[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr6[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr7[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr8[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             {V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]                 out_ptr9[static_cast<int64_t>(0L)] = tmp2;V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] }V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] ''')V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] # kernel path: /tmp/torchinductor_ci-user/gq/cgqhtbf2gwgsfmgfs2f4ajlpqjhrdhscqtxcif6oteglzjt3a3rq.pyV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] # Source node to ATen node mapping:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] triton_for_fused_1 = async_compile.triton('triton_for_fused_1', '''V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import tritonV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] import triton.language as tlV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristicsV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_mathV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DevicePropertiesV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] @triton_heuristics.foreach(V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     filename=__file__,V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr3': '*fp32', 'out_ptr4': '*fp32', 'out_ptr5': '*fp32', 'out_ptr9': '*fp32', 'out_ptr10': '*fp32', 'out_ptr11': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr21': '*fp32', 'out_ptr22': '*fp32', 'out_ptr23': '*fp32', 'out_ptr27': '*fp32', 'out_ptr28': '*fp32', 'out_ptr29': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr39': '*fp32', 'out_ptr40': '*fp32', 'out_ptr41': '*fp32', 'out_ptr45': '*fp32', 'out_ptr46': '*fp32', 'out_ptr47': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr57': '*fp32', 'out_ptr58': '*fp32', 'out_ptr59': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, max_threads_per_block=1024, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_1', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr10', 'out_ptr11', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr21', 'out_ptr22', 'out_ptr23', 'out_ptr27', 'out_ptr28', 'out_ptr29', 'out_ptr3', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr39', 'out_ptr4', 'out_ptr40', 'out_ptr41', 'out_ptr45', 'out_ptr46', 'out_ptr47', 'out_ptr5', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr57', 'out_ptr58', 'out_ptr59', 'out_ptr9'], 'backend_hash': '130560DF8C676AFCBC44717C6A9B3C6A2EC6174C11ECC01A816D2F75FFBF9BD0', 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'deterministic': False, 'force_filter_reduction_configs': False, 'are_deterministic_algorithms_enabled': False},V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] )V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] @triton.jitV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] def triton_for_fused_1(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr3, out_ptr4, out_ptr5, out_ptr9, out_ptr10, out_ptr11, out_ptr15, out_ptr16, out_ptr17, out_ptr21, out_ptr22, out_ptr23, out_ptr27, out_ptr28, out_ptr29, out_ptr33, out_ptr34, out_ptr35, out_ptr39, out_ptr40, out_ptr41, out_ptr45, out_ptr46, out_ptr47, out_ptr51, out_ptr52, out_ptr53, out_ptr57, out_ptr58, out_ptr59):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     pid = tl.program_id(0)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     XBLOCK: tl.constexpr = 1024V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_0 = tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     if pid < num_xblocks_0:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pidV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x0 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp0 = tl.load(in_ptr0 + (x0), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp1 = tl.load(in_ptr1 + (x0), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp8 = tl.load(in_ptr2 + (x0), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp15 = tl.load(in_ptr3 + (x0), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp17 = in_ptr4V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp2 = tmp0 - tmp1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp3 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp4 = tmp3 * tmp2V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp5 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp6 = tl.where(tmp5, tmp0, tmp1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp7 = tmp4 + tmp6V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp9 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp10 = tmp8 * tmp9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp11 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp12 = tmp0 * tmp11V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp13 = tmp12 * tmp0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp14 = tmp10 + tmp13V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp16 = tl.sqrt_rn(tmp14)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp18 = libdevice.pow(tmp9, tmp17)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp19 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp20 = tmp19 - tmp18V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp21 = tl.sqrt_rn(tmp20)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp22 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp23 = libdevice.pow(tmp22, tmp17)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp24 = tmp19 - tmp23V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp25 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp26 = (tmp25 / tmp24)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp27 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp28 = tmp26 * tmp27V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp29 = -tmp28V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp30 = tmp21 * tmp29V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp31 = (tmp16 / tmp30)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp32 = (tmp25 / tmp29)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp33 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp34 = tmp32 * tmp33V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp35 = tmp31 + tmp34V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp36 = (tmp7 / tmp35)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp37 = tmp15 + tmp36V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr3 + (x0), tmp7, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr4 + (x0), tmp14, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr5 + (x0), tmp37, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_1:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x1 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp38 = tl.load(in_ptr5 + (x1), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp39 = tl.load(in_ptr6 + (x1), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp46 = tl.load(in_ptr7 + (x1), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp53 = tl.load(in_ptr8 + (x1), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp55 = in_ptr9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp40 = tmp38 - tmp39V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp41 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp42 = tmp41 * tmp40V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp43 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp44 = tl.where(tmp43, tmp38, tmp39)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp45 = tmp42 + tmp44V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp47 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp48 = tmp46 * tmp47V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp49 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp50 = tmp38 * tmp49V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp51 = tmp50 * tmp38V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp52 = tmp48 + tmp51V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp54 = tl.sqrt_rn(tmp52)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp56 = libdevice.pow(tmp47, tmp55)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp57 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp58 = tmp57 - tmp56V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp59 = tl.sqrt_rn(tmp58)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp60 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp61 = libdevice.pow(tmp60, tmp55)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp62 = tmp57 - tmp61V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp63 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp64 = (tmp63 / tmp62)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp65 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp66 = tmp64 * tmp65V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp67 = -tmp66V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp68 = tmp59 * tmp67V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp69 = (tmp54 / tmp68)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp70 = (tmp63 / tmp67)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp71 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp72 = tmp70 * tmp71V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp73 = tmp69 + tmp72V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp74 = (tmp45 / tmp73)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp75 = tmp53 + tmp74V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr9 + (x1), tmp45, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr10 + (x1), tmp52, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr11 + (x1), tmp75, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_2:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x2 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp76 = tl.load(in_ptr10 + (x2), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp77 = tl.load(in_ptr11 + (x2), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp84 = tl.load(in_ptr12 + (x2), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp91 = tl.load(in_ptr13 + (x2), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp93 = in_ptr14V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp78 = tmp76 - tmp77V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp79 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp80 = tmp79 * tmp78V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp81 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp82 = tl.where(tmp81, tmp76, tmp77)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp83 = tmp80 + tmp82V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp85 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp86 = tmp84 * tmp85V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp87 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp88 = tmp76 * tmp87V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp89 = tmp88 * tmp76V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp90 = tmp86 + tmp89V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp92 = tl.sqrt_rn(tmp90)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp94 = libdevice.pow(tmp85, tmp93)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp95 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp96 = tmp95 - tmp94V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp97 = tl.sqrt_rn(tmp96)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp98 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp99 = libdevice.pow(tmp98, tmp93)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp100 = tmp95 - tmp99V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp101 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp102 = (tmp101 / tmp100)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp103 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp104 = tmp102 * tmp103V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp105 = -tmp104V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp106 = tmp97 * tmp105V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp107 = (tmp92 / tmp106)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp108 = (tmp101 / tmp105)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp109 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp110 = tmp108 * tmp109V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp111 = tmp107 + tmp110V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp112 = (tmp83 / tmp111)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp113 = tmp91 + tmp112V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr15 + (x2), tmp83, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr16 + (x2), tmp90, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr17 + (x2), tmp113, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_3:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_2V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x3 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp114 = tl.load(in_ptr15 + (x3), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp115 = tl.load(in_ptr16 + (x3), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp122 = tl.load(in_ptr17 + (x3), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp129 = tl.load(in_ptr18 + (x3), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp131 = in_ptr19V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp116 = tmp114 - tmp115V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp117 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp118 = tmp117 * tmp116V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp119 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp120 = tl.where(tmp119, tmp114, tmp115)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp121 = tmp118 + tmp120V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp123 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp124 = tmp122 * tmp123V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp125 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp126 = tmp114 * tmp125V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp127 = tmp126 * tmp114V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp128 = tmp124 + tmp127V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp130 = tl.sqrt_rn(tmp128)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp132 = libdevice.pow(tmp123, tmp131)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp133 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp134 = tmp133 - tmp132V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp135 = tl.sqrt_rn(tmp134)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp136 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp137 = libdevice.pow(tmp136, tmp131)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp138 = tmp133 - tmp137V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp139 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp140 = (tmp139 / tmp138)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp141 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp142 = tmp140 * tmp141V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp143 = -tmp142V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp144 = tmp135 * tmp143V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp145 = (tmp130 / tmp144)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp146 = (tmp139 / tmp143)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp147 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp148 = tmp146 * tmp147V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp149 = tmp145 + tmp148V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp150 = (tmp121 / tmp149)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp151 = tmp129 + tmp150V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr21 + (x3), tmp121, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr22 + (x3), tmp128, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr23 + (x3), tmp151, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_4:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_3V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x4 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp152 = tl.load(in_ptr20 + (x4), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp153 = tl.load(in_ptr21 + (x4), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp160 = tl.load(in_ptr22 + (x4), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp167 = tl.load(in_ptr23 + (x4), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp169 = in_ptr24V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp154 = tmp152 - tmp153V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp155 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp156 = tmp155 * tmp154V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp157 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp158 = tl.where(tmp157, tmp152, tmp153)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp159 = tmp156 + tmp158V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp161 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp162 = tmp160 * tmp161V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp163 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp164 = tmp152 * tmp163V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp165 = tmp164 * tmp152V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp166 = tmp162 + tmp165V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp168 = tl.sqrt_rn(tmp166)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp170 = libdevice.pow(tmp161, tmp169)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp171 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp172 = tmp171 - tmp170V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp173 = tl.sqrt_rn(tmp172)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp174 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp175 = libdevice.pow(tmp174, tmp169)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp176 = tmp171 - tmp175V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp177 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp178 = (tmp177 / tmp176)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp179 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp180 = tmp178 * tmp179V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp181 = -tmp180V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp182 = tmp173 * tmp181V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp183 = (tmp168 / tmp182)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp184 = (tmp177 / tmp181)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp185 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp186 = tmp184 * tmp185V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp187 = tmp183 + tmp186V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp188 = (tmp159 / tmp187)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp189 = tmp167 + tmp188V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr27 + (x4), tmp159, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr28 + (x4), tmp166, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr29 + (x4), tmp189, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_5:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_4V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x5 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp190 = tl.load(in_ptr25 + (x5), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp191 = tl.load(in_ptr26 + (x5), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp198 = tl.load(in_ptr27 + (x5), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp205 = tl.load(in_ptr28 + (x5), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp207 = in_ptr29V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp192 = tmp190 - tmp191V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp193 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp194 = tmp193 * tmp192V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp195 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp196 = tl.where(tmp195, tmp190, tmp191)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp197 = tmp194 + tmp196V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp199 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp200 = tmp198 * tmp199V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp201 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp202 = tmp190 * tmp201V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp203 = tmp202 * tmp190V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp204 = tmp200 + tmp203V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp206 = tl.sqrt_rn(tmp204)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp208 = libdevice.pow(tmp199, tmp207)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp209 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp210 = tmp209 - tmp208V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp211 = tl.sqrt_rn(tmp210)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp212 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp213 = libdevice.pow(tmp212, tmp207)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp214 = tmp209 - tmp213V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp215 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp216 = (tmp215 / tmp214)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp217 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp218 = tmp216 * tmp217V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp219 = -tmp218V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp220 = tmp211 * tmp219V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp221 = (tmp206 / tmp220)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp222 = (tmp215 / tmp219)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp223 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp224 = tmp222 * tmp223V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp225 = tmp221 + tmp224V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp226 = (tmp197 / tmp225)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp227 = tmp205 + tmp226V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr33 + (x5), tmp197, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr34 + (x5), tmp204, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr35 + (x5), tmp227, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_6:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_5V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x6 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp228 = tl.load(in_ptr30 + (x6), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp229 = tl.load(in_ptr31 + (x6), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp236 = tl.load(in_ptr32 + (x6), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp243 = tl.load(in_ptr33 + (x6), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp245 = in_ptr34V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp230 = tmp228 - tmp229V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp231 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp232 = tmp231 * tmp230V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp233 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp234 = tl.where(tmp233, tmp228, tmp229)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp235 = tmp232 + tmp234V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp237 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp238 = tmp236 * tmp237V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp239 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp240 = tmp228 * tmp239V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp241 = tmp240 * tmp228V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp242 = tmp238 + tmp241V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp244 = tl.sqrt_rn(tmp242)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp246 = libdevice.pow(tmp237, tmp245)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp247 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp248 = tmp247 - tmp246V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp249 = tl.sqrt_rn(tmp248)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp250 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp251 = libdevice.pow(tmp250, tmp245)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp252 = tmp247 - tmp251V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp253 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp254 = (tmp253 / tmp252)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp255 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp256 = tmp254 * tmp255V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp257 = -tmp256V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp258 = tmp249 * tmp257V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp259 = (tmp244 / tmp258)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp260 = (tmp253 / tmp257)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp261 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp262 = tmp260 * tmp261V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp263 = tmp259 + tmp262V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp264 = (tmp235 / tmp263)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp265 = tmp243 + tmp264V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr39 + (x6), tmp235, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr40 + (x6), tmp242, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr41 + (x6), tmp265, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_7:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_6V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x7 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp266 = tl.load(in_ptr35 + (x7), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp267 = tl.load(in_ptr36 + (x7), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp274 = tl.load(in_ptr37 + (x7), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp281 = tl.load(in_ptr38 + (x7), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp283 = in_ptr39V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp268 = tmp266 - tmp267V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp269 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp270 = tmp269 * tmp268V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp271 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp272 = tl.where(tmp271, tmp266, tmp267)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp273 = tmp270 + tmp272V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp275 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp276 = tmp274 * tmp275V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp277 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp278 = tmp266 * tmp277V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp279 = tmp278 * tmp266V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp280 = tmp276 + tmp279V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp282 = tl.sqrt_rn(tmp280)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp284 = libdevice.pow(tmp275, tmp283)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp285 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp286 = tmp285 - tmp284V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp287 = tl.sqrt_rn(tmp286)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp288 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp289 = libdevice.pow(tmp288, tmp283)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp290 = tmp285 - tmp289V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp291 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp292 = (tmp291 / tmp290)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp293 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp294 = tmp292 * tmp293V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp295 = -tmp294V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp296 = tmp287 * tmp295V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp297 = (tmp282 / tmp296)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp298 = (tmp291 / tmp295)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp299 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp300 = tmp298 * tmp299V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp301 = tmp297 + tmp300V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp302 = (tmp273 / tmp301)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp303 = tmp281 + tmp302V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr45 + (x7), tmp273, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr46 + (x7), tmp280, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr47 + (x7), tmp303, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_8:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_7V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x8 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp304 = tl.load(in_ptr40 + (x8), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp305 = tl.load(in_ptr41 + (x8), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp312 = tl.load(in_ptr42 + (x8), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp319 = tl.load(in_ptr43 + (x8), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp321 = in_ptr44V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp306 = tmp304 - tmp305V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp307 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp308 = tmp307 * tmp306V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp309 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp310 = tl.where(tmp309, tmp304, tmp305)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp311 = tmp308 + tmp310V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp313 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp314 = tmp312 * tmp313V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp315 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp316 = tmp304 * tmp315V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp317 = tmp316 * tmp304V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp318 = tmp314 + tmp317V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp320 = tl.sqrt_rn(tmp318)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp322 = libdevice.pow(tmp313, tmp321)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp323 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp324 = tmp323 - tmp322V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp325 = tl.sqrt_rn(tmp324)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp326 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp327 = libdevice.pow(tmp326, tmp321)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp328 = tmp323 - tmp327V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp329 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp330 = (tmp329 / tmp328)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp331 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp332 = tmp330 * tmp331V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp333 = -tmp332V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp334 = tmp325 * tmp333V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp335 = (tmp320 / tmp334)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp336 = (tmp329 / tmp333)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp337 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp338 = tmp336 * tmp337V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp339 = tmp335 + tmp338V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp340 = (tmp311 / tmp339)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp341 = tmp319 + tmp340V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr51 + (x8), tmp311, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr52 + (x8), tmp318, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr53 + (x8), tmp341, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     elif pid < num_xblocks_9:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         pid_offset = pid - num_xblocks_8V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xnumel = 1048576V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         r0_numel = 1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xoffset = pid_offset * XBLOCKV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)[:]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         x9 = xindexV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp342 = tl.load(in_ptr45 + (x9), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp343 = tl.load(in_ptr46 + (x9), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp350 = tl.load(in_ptr47 + (x9), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp357 = tl.load(in_ptr48 + (x9), None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp359 = in_ptr49V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp344 = tmp342 - tmp343V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp345 = 0.10000000149011612V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp346 = tmp345 * tmp344V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp347 = tl.full([1], False, tl.int1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp348 = tl.where(tmp347, tmp342, tmp343)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp349 = tmp346 + tmp348V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp351 = 0.999V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp352 = tmp350 * tmp351V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp353 = 0.0010000000000000009V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp354 = tmp342 * tmp353V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp355 = tmp354 * tmp342V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp356 = tmp352 + tmp355V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp358 = tl.sqrt_rn(tmp356)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp360 = libdevice.pow(tmp351, tmp359)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp361 = 1.0V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp362 = tmp361 - tmp360V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp363 = tl.sqrt_rn(tmp362)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp364 = 0.9V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp365 = libdevice.pow(tmp364, tmp359)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp366 = tmp361 - tmp365V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp367 = tl.full([1], 1, tl.int32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp368 = (tmp367 / tmp366)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp369 = 0.001V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp370 = tmp368 * tmp369V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp371 = -tmp370V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp372 = tmp363 * tmp371V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp373 = (tmp358 / tmp372)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp374 = (tmp367 / tmp371)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp375 = 1e-08V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp376 = tmp374 * tmp375V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp377 = tmp373 + tmp376V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp378 = (tmp349 / tmp377)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tmp379 = tmp357 + tmp378V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr57 + (x9), tmp349, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr58 + (x9), tmp356, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         tl.store(out_ptr59 + (x9), tmp379, None)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     else:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         passV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] ''', device_str='cuda')V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] async_compile.wait(globals())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] del async_compileV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] class Runner:V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     def __init__(self, partitions):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         self.partitions = partitionsV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     def recursively_apply_fns(self, fns):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         new_callables = []V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         for fn, c in zip(fns, self.partitions):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             new_callables.append(fn(c))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         self.partitions = new_callablesV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     def call(self, args):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = argsV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         args.clear()V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg0_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg1_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg2_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg3_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg4_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg5_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg6_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg7_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg8_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg9_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg10_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg11_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg12_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg13_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg14_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg15_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg16_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg17_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg18_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg19_1, (), ())V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg20_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg21_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg22_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg23_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg24_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg25_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg26_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg27_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg28_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg29_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg30_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg31_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg32_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg33_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg34_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg35_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg36_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg37_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg38_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg39_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg40_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg41_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg42_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg43_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg44_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg45_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg46_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg47_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg48_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         assert_size_stride(arg49_1, (1024, 1024), (1024, 1))V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         cpp_fused__foreach_copy_0(arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         with torch.cuda._DeviceGuard(0):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             torch.cuda.set_device(0)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             # Unsorted Source Nodes: [], Original ATen: []V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             stream0 = get_raw_stream(0)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             triton_for_fused_1.run(arg30_1, arg20_1, arg40_1, arg0_1, arg10_1.item(), arg31_1, arg21_1, arg41_1, arg1_1, arg11_1.item(), arg32_1, arg22_1, arg42_1, arg2_1, arg12_1.item(), arg33_1, arg23_1, arg43_1, arg3_1, arg13_1.item(), arg34_1, arg24_1, arg44_1, arg4_1, arg14_1.item(), arg35_1, arg25_1, arg45_1, arg5_1, arg15_1.item(), arg36_1, arg26_1, arg46_1, arg6_1, arg16_1.item(), arg37_1, arg27_1, arg47_1, arg7_1, arg17_1.item(), arg38_1, arg28_1, arg48_1, arg8_1, arg18_1.item(), arg39_1, arg29_1, arg49_1, arg9_1, arg19_1.item(), arg20_1, arg40_1, arg0_1, arg21_1, arg41_1, arg1_1, arg22_1, arg42_1, arg2_1, arg23_1, arg43_1, arg3_1, arg24_1, arg44_1, arg4_1, arg25_1, arg45_1, arg5_1, arg26_1, arg46_1, arg6_1, arg27_1, arg47_1, arg7_1, arg28_1, arg48_1, arg8_1, arg29_1, arg49_1, arg9_1, stream=stream0)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg0_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg10_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg11_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg12_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg13_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg14_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg15_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg16_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg17_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg18_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg19_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg1_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg20_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg21_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg22_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg23_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg24_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg25_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg26_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg27_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg28_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg29_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg2_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg30_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg31_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg32_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg33_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg34_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg35_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg36_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg37_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg38_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg39_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg3_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg40_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg41_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg42_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg43_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg44_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg45_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg46_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg47_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg48_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg49_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg4_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg5_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg6_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg7_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg8_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]             del arg9_1V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]         return ()V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] runner = Runner(partitions=[])V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] call = runner.callV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] recursively_apply_fns = runner.recursively_apply_fnsV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] def benchmark_compiled_module(times=10, repeat=10):V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     from torch._dynamo.testing import rand_stridedV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     from torch._inductor.utils import print_performanceV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg10_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg11_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg12_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg13_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg14_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg15_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg16_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg17_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg18_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg19_1 = rand_strided((), (), device='cpu', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg20_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg21_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg22_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg23_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg24_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg25_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg26_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg27_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg28_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg29_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     return print_performance(fn, times=times, repeat=repeat)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code] if __name__ == "__main__":V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_mainV0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]     compiled_module_main('None', benchmark_compiled_module)V0219 17:08:24.508000 23788 torch/_inductor/graph.py:2469] [0/1] [__output_code]V0219 17:08:24.555000 23788 torch/_inductor/graph.py:2480] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/ux/cuxhxe2uod67dudxwleao7vhfdxcbknp46g57v4y2g2humuvkdsg.pyI0219 17:08:24.695000 23788 torch/_inductor/graph.py:2440] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/ux/cuxhxe2uod67dudxwleao7vhfdxcbknp46g57v4y2g2humuvkdsg.pyeager runtime: 1201.9012949997432uscompiled runtime: 767.0238122540524us

Conclusion#

In this tutorial, we successfully implemented a custom fully-fused Adam optimizer using foreach_map.By leveraging the power of foreach_map and torch.compile, we were able to create an optimized version of the Adamoptimizer that can be used in various machine learning applications. This tutorial provides a comprehensive guideon how to use foreach_map and torch.compile to optimize machine learning models, and serves as avaluable resource for developers looking to improve the performance of their models with horizontal fusion.

See also:

Total running time of the script: (0 minutes 10.493 seconds)