Part 2: CUDA Implementation

Fig. 24Inline and lookaside in the gNB DU processing pipeline. The figure is from[Kundu2023B]. Note that the DGX Spark and Jetson AGX Thor platform use a unified memory architecture, which can be seen as a hybrid of the inline and lookaside memory management.
The second part of this tutorial focuses on the implementation of the LDPC decoder using CUDA and explains the common pitfalls when offloading compute-intensive functions to the GPU. Further, we show how GPU acceleration offloading of the LDPC decoding can be integrated into the OAI stack.
Function offloading to dedicated accelerators requires careful consideration of the data flow with respect to the underlying memory architecture. This is especially critical for wireless communications applications, where strict latency requirements in the receiver processing pipeline necessitate efficient memory management when utilizing dedicated signal processing accelerators.
As shown inFig. 24, one needs to distinguish betweeninline andlookaside acceleration. While lookaside acceleration moves data between the CPU and the hardware accelerator, inline acceleration avoids such data movement by applying the entire processing pipeline on the hardware accelerator.
Strictly speaking, DGX Spark still moves data between the CPU and GPU. However, the overhead is significantly reduced compared to traditional split-memory architectures as DGX Spark shares the same physical memory between CPU and GPU. This has implications for the caching behavior and requires a careful implementation of the CUDA kernels to avoid performance degradation. Nevertheless, we will consider it as inline acceleration for the following discussion.
For further details on CUDA, we refer to theNVIDIA CUDA Programming Guide[CUDA2024] and for the Jetson platform to theCUDA for Tegra Memory Model[Tegra2024].
Overview

Fig. 25Overview of CUDA implementation of the LDPC BP decoding algorithm.
The CUDA implementation can be found inplugins/ldpc_cuda/src/runtime/ldpc_decoder.cu. The core decoding algorithm is implemented in theupdate_cn_kernel(.) andupdate_vn_kernel(.) functions. Both kernels are iteratively called and perform the check node (CN) and variable node (VN) updates, respectively. The decoder stops when the maximum number of iterations is reached. An additional early stopping condition could also be applied to reduce the average number of iterations.
Thepack_bits_kernel(.) kernel maps the soft-values to hard-decided bits and packs them into a more compact byte-representation which is required for the OAI processing pipeline.
CUDA Integration in OAI
The LDPC decoder is implemented asshared library that can be loaded using theOAI shared library loader.The implementation is inplugins/ldpc_cuda/src/runtime/ldpc_decoder.cu. After modifying it, rebuild the Docker images:
./scripts/build-oai-images.sh
Running the Decoder
The CUDA-based decoder can be used as a drop-in replacement for the existing decoder implementations. It can be loaded when running the gNB via the followingGNB_EXTRA_OPTIONS in the.env file of the config folder.
GNB_EXTRA_OPTIONS=--loader.ldpc.shlibversion_cuda
We strongly recommend to additionally assign dedicated CPU cores to PHY-layer processing via thethread-pool option. This assigns the cores 5-9 to the PHY layer thread pool. Note that the lower CPU cores are assigned to handle the USRP-related tasks such as time synchronization.
You can now start the gNB with the CUDA-based decoder by running
scripts/start_system.shrfsim# replace with b200 or other config folderThe GPU load can be monitored via
nvidia-smi# or jtop on Jetson devicesCongratulations! You have now successfully accelerated the LDPC decoder using CUDA.
Check the gNB logs for the CUDA decoder:
dockerlogsoai-gnb
You should see the CUDA decoder being used (requires iperf3 traffic to trigger the decoder):
>...>[NR_PHY]I{L1_rx_thread}CUDALDPCdecoder:123.55us(96.54us/seg)>...
Implementation Aspects
The following sections focus on various technical aspects of the CUDA implementation and the performance implications of different memory transfer patterns.
For debugging and profiling, please refer to the tutorial onDebugging & Troubleshooting.
Memory Management
In order to offload compute-intensive processing to the GPU, data needs to be shared between the host (CPU) and the device (GPU). We can leverage the shared system memory architecture of DGX Spark and Jetson platforms to avoid the bottleneck of costly memory transfers on traditional split-memory platforms.
In fact, we can avoid the overhead of any complex API calls and memory transition operations by allocating page-locked memory on the host usingcudaHostAlloc. To make input LLRs visible to the GPU, we then use a simplememcpy() operation to copy inputs from the 5G stack over into such a page-locked buffer, where unified memory addressing allows direct reads and writes both on the host and the device. For output bits, we first synchronize the parallel CUDA command stream and then simply usememcpy() to copy packed bits from the shared page-locked buffer into the 5G stack output buffer.
This simplicity is enabled by the unified memory architecture. On DGX Spark (Grace Blackwell), hardware cache coherency via NVLink-C2C ensures efficient data sharing. On Jetson platforms, the cache semantics[Tegra2024] require host caches to be active while device memory caching is disabled on page-locked buffers. For optimal performance on both platforms, we ensure maximally coalesced read-once and write-once memory access patterns (addressing consecutive memory in consecutive threads).
Traditional memory transfer patterns designed for split-memory architectures can also be used on DGX Spark, but can lead to higher overheads depending on the exact architecture and generation. Explicit memory transfer calls such ascudaMemcpyAsync(inputs...,cudaMemcpyHostToDevice,stream) andcudaMemcpyAsync(outputs...,cudaMemcpyDeviceToHost,stream) incur additional API overheads and may depend on availability of additional copy engines; explicit transitioning of managed pageable memory allocated bycudaMallocManaged() bycudaStreamAttachMemAsync() may also require excessive API overhead for small buffer sizes. Therefore, we instead use the patterns described above to optimize for shared system memory and low latency, particularly since input and output buffers are typically small in the real-time 5G stack.
Note
On the DGX Spark architecture,cudaMallocManaged() can be even more efficient for some use cases. For an example implementation, see the5G NR PUSCH Neural Receiver tutorial.
For comparison, we show both variants side-by-side in the following code, where the latency-optimized code path is the one withUSE_UNIFIED_MEMORY defined.We copy the input LLRs from the host to the device memory in theldpc_decoder.cu file:
1// copy input data to device-visible memory 2#ifdef USE_UNIFIED_MEMORY 3memcpy(const_cast<int8_t*>(mapped_llr_in),llr_in,num_llrs*sizeof(*llr_in)); 4#else 5CHECK_CUDA(cudaMemcpyAsync(const_cast<int8_t*>(mapped_llr_in),llr_in,num_llrs*sizeof(*llr_in),cudaMemcpyHostToDevice,stream)); 6#endif 7// END marker-copy-input 8 9#ifdef PRINT_TIMES 10clock_gettime(TIMESTAMP_CLOCK_SOURCE,&ts_cursor); 11 12time_ns=ts_cursor.tv_nsec-ts_begin.tv_nsec+1000000000ll*(ts_cursor.tv_sec-ts_begin.tv_sec); 13message_count=add_measurement(&input_copy_time,time_ns,500); 14if(message_count%500==0){ 15time_ns=input_copy_time.avg_ns; 16printf("Input copy time: %llu us %llu ns\n",time_ns/1000,time_ns-time_ns/1000*1000); 17fflush(stdout); 18} 19#endif 20 21int8_tconst*llr_total=mapped_llr_in; 22 23for(uint32_ti=0;i<num_iter;++i){ 24dim3threads(NODE_KERNEL_BLOCK,UNROLL_NODES); 25 26// check node update 27dim3blocks_cn(blocks_for(bg.num_rows*Z,threads.x)); 28// note: llr_msg not not read, only written to in first iteration; will be filled with outputs of this function 29update_cn_kernel<<<blocks_cn,threads,0,stream>>>( 30llr_total,context.llr_msg_buffer, 31Z,bg.cn,bg.cn_degree,bg.cn_stride,bg.num_rows,i==0); 32 33// variable node update 34dim3blocks_vn(blocks_for(bg.num_cols*Z,threads.x)); 35// note: llr_total only written to 36update_vn_kernel<<<blocks_vn,threads,0,stream>>>( 37context.llr_msg_buffer,mapped_llr_in,context.llr_total_buffer, 38Z,bg.vn,bg.vn_degree,bg.vn_stride,bg.num_cols,bg.num_rows); 39llr_total=context.llr_total_buffer; 40} 41 42uint8_t*mapped_llr_bits_out=context.llr_bits_out_buffer; 43 44// pack bits 45dim3threads_pack(PACK_BITS_KERNEL_THREADS); 46dim3blocks_pack(blocks_for(block_length,threads_pack.x)); 47pack_bits_kernel<<<blocks_pack,threads_pack,0,stream>>>( 48llr_total,mapped_llr_bits_out,block_length); 49#ifndef USE_UNIFIED_MEMORY 50CHECK_CUDA(cudaMemcpyAsync(llr_bits,mapped_llr_bits_out,num_out_bytes,cudaMemcpyDeviceToHost,stream)); 51#endif 52 53// allow CPU access of output bits while computing syndrome 54#if defined(USE_UNIFIED_MEMORY) && !defined(USE_GRAPHS) 55cudaStreamSynchronize(stream); 56#endif 57 58#ifdef PRINT_TIMES 59clock_gettime(TIMESTAMP_CLOCK_SOURCE,&ts_cursor); 60#endif 61 62// check syndrome if additional testing is requested 63if(perform_syndrome_check){ 64dim3threads(512); 65dim3blocks_cn(blocks_for(num_cn,threads.x)); 66compute_syndrome_kernel<<<blocks_cn,threads,0,stream>>>( 67context.llr_total_buffer,context.syndrome_buffer, 68Z,bg.cn,bg.cn_degree,bg.cn_stride,bg.num_rows); 69#ifndef USE_UNIFIED_MEMORY 70CHECK_CUDA(cudaMemcpyAsync(context.host_syndrome_buffer,context.syndrome_buffer,num_cn*sizeof(*context.syndrome_buffer)/32,cudaMemcpyDeviceToHost,stream)); 71#endif 72} 73 74#ifdef USE_GRAPHS 75cudaGraph_tgraphUpdate={}; 76CHECK_CUDA(cudaStreamEndCapture(stream,&graphUpdate)); 77if(context.graphCtx){ 78cudaGraphNode_terrorNode; 79cudaGraphExecUpdateResultupdateResult; 80CHECK_CUDA(cudaGraphExecUpdate(context.graphCtx,graphUpdate,&errorNode,&updateResult)); 81} 82else 83CHECK_CUDA(cudaGraphInstantiate(&context.graphCtx,graphUpdate,0)); 84cudaGraphDestroy(graphUpdate); 85CHECK_CUDA(cudaGraphLaunch(context.graphCtx,stream)); 86#endif 87 88#if !defined(USE_UNIFIED_MEMORY) || defined(USE_GRAPHS) 89// allow CPU access of output bits and syndrome 90cudaStreamSynchronize(stream); 91#endif 92 93#ifdef USE_UNIFIED_MEMORY 94// note: GPU synchronized before async syndrome check 95memcpy(llr_bits,mapped_llr_bits_out,num_out_bytes); 96#endif 97 98#ifdef PRINT_TIMES 99clock_gettime(TIMESTAMP_CLOCK_SOURCE,&ts_end);100101time_ns=ts_end.tv_nsec-ts_cursor.tv_nsec+1000000000ll*(ts_end.tv_sec-ts_cursor.tv_sec);102message_count=add_measurement(&output_copy_time,time_ns,500);103if(message_count%500==0){104time_ns=output_copy_time.avg_ns;105printf("Output copy time: %llu us %llu ns\n",time_ns/1000,time_ns-time_ns/1000*1000);106fflush(stdout);107}108#endif109110if(perform_syndrome_check){111uint32_t*p_syndrome;112#ifdef USE_UNIFIED_MEMORY113#ifndef USE_GRAPHS114// allow reading syndrome115cudaStreamSynchronize(stream);116#endif117p_syndrome=context.syndrome_buffer;118#else119// note: already synchronized above120p_syndrome=context.host_syndrome_buffer;121#endif122123// check any errors indicated by syndrome124for(uint32_ti=0;i<num_cn/32;i++){125if(p_syndrome[i]!=0){126returnnum_iter+1;127}128}129}130131#ifdef PRINT_TIMES132clock_gettime(TIMESTAMP_CLOCK_SOURCE,&ts_end);133134time_ns=ts_end.tv_nsec-ts_begin.tv_nsec+1000000000ll*(ts_end.tv_sec-ts_begin.tv_sec);135message_count=add_measurement(&decoding_time,time_ns,500);136if(message_count%500==0){137time_ns=decoding_time.avg_ns;138printf("CUDA sync runtime: %llu us %llu ns\n",time_ns/1000,time_ns-time_ns/1000*1000);139fflush(stdout);140}141#endif142143returnnum_iter-1;// note: now index of successful iteration144}145146ThreadContext&ldpc_decoder_init_context(intmake_stream){147auto&context=thread_context;148if(context.llr_in_buffer)// lazy149returncontext;150151printf("Initializing LDPC context (TID %d)\n",(int)gettid());152153if(make_stream){154inthighPriority=0;155if(cudaDeviceGetStreamPriorityRange(NULL,&highPriority))156printf("CUDA stream priorities unsupported, %s:%d",__FILE__,__LINE__);157CHECK_CUDA(cudaStreamCreateWithPriority(&context.stream,cudaStreamNonBlocking,highPriority));158159cudaStreamAttrValueattr={};160attr.syncPolicy=cudaSyncPolicyYield;161cudaStreamSetAttribute(context.stream,cudaStreamAttributeSynchronizationPolicy,&attr);162}163164CHECK_CUDA(cudaMallocStaging(&context.llr_in_buffer,MAX_BG_COLS*MAX_Z*sizeof(int8_t),cudaHostAllocMapped|cudaHostAllocWriteCombined));165CHECK_CUDA(cudaMallocStaging(&context.llr_bits_out_buffer,(MAX_BLOCK_LENGTH+7)/8*sizeof(uint8_t),cudaHostAllocMapped));166CHECK_CUDA(cudaMallocStaging(&context.syndrome_buffer,MAX_BG_ROWS*MAX_Z*sizeof(uint32_t)/32,cudaHostAllocMapped));167#ifndef USE_UNIFIED_MEMORY168context.host_syndrome_buffer=(uint8_t*)malloc(MAX_BG_ROWS*MAX_Z*sizeof(uint8_t));169#endif170CHECK_CUDA(cudaMalloc(&context.llr_msg_buffer,MAX_BG_ROWS*MAX_BG_COLS*MAX_Z*sizeof(llr_msg_t)));171CHECK_CUDA(cudaMalloc(&context.llr_total_buffer,MAX_BG_COLS*MAX_Z*sizeof(llr_accumulator_t)));172173// keep track of active thread contexts for shutdown174ThreadContext*self=&context;175__atomic_exchange(&initialized_thread_contexts,&self,&self->next_initialized_context,__ATOMIC_ACQ_REL);176177returncontext;178}179180extern"C"ThreadContext*ldpc_decoder_init(intmake_stream){181if(bg_cn[0][0])// lazy, global182return&ldpc_decoder_init_context(make_stream);183184printf("Initializing LDPC runtime %d\n",(int)gettid());185186constuint32_t*table_bg_cn_degree[2][8]={{BG1_CN_DEGREE_TABLE()},{BG2_CN_DEGREE_TABLE()}};187constuint32_t*table_bg_vn_degree[2][8]={{BG1_VN_DEGREE_TABLE()},{BG2_VN_DEGREE_TABLE()}};188constuint32_ttable_bg_cn_degree_size[2][8]={{BG1_CN_DEGREE_TABLE(sizeof)},{BG2_CN_DEGREE_TABLE(sizeof)}};189constuint32_ttable_bg_vn_degree_size[2][8]={{BG1_VN_DEGREE_TABLE(sizeof)},{BG2_VN_DEGREE_TABLE(sizeof)}};190constvoid*table_bg_cn[2][8]={{BG1_CN_TABLE()},{BG2_CN_TABLE()}};191constvoid*table_bg_vn[2][8]={{BG1_VN_TABLE()},{BG2_VN_TABLE()}};192constuint32_ttable_bg_cn_size[2][8]={{BG1_CN_TABLE(sizeof)},{BG2_CN_TABLE(sizeof)}};193constuint32_ttable_bg_vn_size[2][8]={{BG1_VN_TABLE(sizeof)},{BG2_VN_TABLE(sizeof)}};194195for(intb=0;b<2;++b){196for(intils=0;ils<8;++ils){197CHECK_CUDA(cudaMalloc(&bg_cn_degree[b][ils],table_bg_cn_degree_size[b][ils]));198CHECK_CUDA(cudaMemcpy(const_cast<uint32_t*>(bg_cn_degree[b][ils]),table_bg_cn_degree[b][ils],table_bg_cn_degree_size[b][ils],cudaMemcpyHostToDevice));199CHECK_CUDA(cudaMalloc(&bg_vn_degree[b][ils],table_bg_vn_degree_size[b][ils]));200CHECK_CUDA(cudaMemcpy(const_cast<uint32_t*>(bg_vn_degree[b][ils]),table_bg_vn_degree[b][ils],table_bg_vn_degree_size[b][ils],cudaMemcpyHostToDevice));201202CHECK_CUDA(cudaMalloc(&bg_cn[b][ils],table_bg_cn_size[b][ils]));203CHECK_CUDA(cudaMemcpy(const_cast<uint32_t*>(bg_cn[b][ils]),table_bg_cn[b][ils],table_bg_cn_size[b][ils],cudaMemcpyHostToDevice));204CHECK_CUDA(cudaMalloc(&bg_vn[b][ils],table_bg_vn_size[b][ils]));205CHECK_CUDA(cudaMemcpy(const_cast<uint32_t*>(bg_vn[b][ils]),table_bg_vn[b][ils],table_bg_vn_size[b][ils],cudaMemcpyHostToDevice));206207bg_cn_size[b][ils]=table_bg_cn_size[b][ils];208bg_vn_size[b][ils]=table_bg_vn_size[b][ils];209}210}211212return&ldpc_decoder_init_context(make_stream);213}214215extern"C"voidldpc_decoder_shutdown(){216cudaDeviceSynchronize();217218ThreadContext*active_context=nullptr;219__atomic_exchange(&initialized_thread_contexts,&active_context,&active_context,__ATOMIC_ACQ_REL);220while(active_context){221cudaFreeStaging(active_context->llr_in_buffer);222cudaFree(active_context->llr_msg_buffer);223cudaFreeStaging(active_context->llr_bits_out_buffer);224cudaFree(active_context->llr_total_buffer);225cudaFreeStaging(active_context->syndrome_buffer);226#ifndef USE_UNIFIED_MEMORY227free(active_context->host_syndrome_buffer);228#endif229if(active_context->stream)230cudaStreamDestroy(active_context->stream);231232#ifdef USE_GRAPHS233cudaGraphExecDestroy(active_context->graphCtx);234#endif235236active_context=active_context->next_initialized_context;237}238239for(intb=0;b<2;++b){240for(intils=0;ils<8;++ils){241cudaFree(&bg_cn_degree[b][ils]);242cudaFree(&bg_vn_degree[b][ils]);243cudaFree(&bg_cn[b][ils]);244cudaFree(&bg_vn[b][ils]);245}246}247}248249#ifdef ENABLE_NANOBIND250251#include<nanobind/nanobind.h>252#include<nanobind/ndarray.h>253254namespacenb=nanobind;255256NB_MODULE(ldpc_decoder,m){257m.def("decode",[](uint32_tBG,uint32_tZ,258constnb::ndarray<int8_t,nb::shape<-1>,nb::device::cpu>&llrs,259uint32_tblock_length,uint32_tnum_iter){260auto*context=ldpc_decoder_init(1);// lazy261262size_tnum_bytes=(block_length+7)/8*8;263uint8_t*data=newuint8_t[num_bytes];264memset(data,0,num_bytes);265nb::capsuleowner(data,[](void*p)noexcept{delete[](uint8_t*)p;});266267ldpc_decode(context,0,BG,Z,llrs.data(),268block_length,data,269num_iter,true);270271returnnb::ndarray<nb::numpy,uint8_t>(data,{num_bytes},owner);272});273}274275#endif
1// copy input LLRs from the host to the device memory2cudaMemcpyAsync(context.llr_in_buffer,p_in,memorySize_llr_in,cudaMemcpyHostToDevice,context.stream);
After decoding, we make the output bits available via a host-side copy in parallel to an asynchronous syndrome check on the device:
1// pack LDPC output bits on the device 2intpack_thread_blocks=(block_length+127)/128; 3pack_decoded_bit_kernel<<<pack_thread_blocks,128,0,context.stream>>>(context.dev_llr_out,context.dev_bits_out,block_length); 4 5// allow CPU access of output bits while computing syndrome 6#ifdef USE_UNIFIED_MEMORY 7cudaStreamSynchronize(context.stream); 8 9// ... schedule syndrome or CRC check ...1011// while syndrome computations are running on the device, copy output bits to 5G stack output buffer12memcpy(p_out,context.dev_bits_out,memorySize_bits_out);13#else14cudaCheck(cudaMemcpyAsync(p_out,context.dev_bits_out,memorySize_bits_out,cudaMemcpyDeviceToHost,context.stream));1516// ... schedule syndrome or CRC check ...1718// allow CPU access of output bits and syndrome19cudaStreamSynchronize(context.stream);20#endif
Kernel Optimization
The core idea of CUDA programming is to parallelize the execution of the kernel on the GPU. The kernel is executed by a grid of blocks, each containing a number of independent threads. This level of parallelism allows for significant speedups compared to the serial execution on the CPU.
For a detailed introduction to CUDA programming, we refer to theCUDA Programming Guide.Conceptually, a CUDA kernel is defined as follows
// Kernel definition: __global__ indicates that the function is a CUDA kernel__global__voidmy_kernel(float*data,intN){// Each thread processes one element of the array// We calculate the global thread index from the block and thread indices// idx is unique for each threadintidx=threadIdx.x+blockIdx.x*blockDim.x;if(idx<N){// process the idx-th element of the arraydata[idx]=...;}}
This kernel can now be launched by specifying its grid and block dimensions via
// Launch the kernel// <<<1, 32>>> specifies the grid and block sizesmy_kernel<<<1,32>>>(d_data,N);
For the case of LDPC decoding, the decoder can be parallelized over the number of variable (VN) and check node (CN) updates, respectively.An overview of the CUDA implementation is given inFig. 25. The VN update kernel is given as
1__launch_bounds__(UNROLL_NODES*NODE_KERNEL_BLOCK,3) 2static__global__voidupdate_vn_kernel(llr_msg_tconst*__restrict__llr_msg,int8_tconst*__restrict__llr_ch,llr_accumulator_t*__restrict__llr_total, 3uint32_tZ,uint32_tconst*__restrict__bg_vn,uint32_tconst*__restrict__bg_vn_degree,uint32_tmax_degree,uint32_tnum_cols,uint32_tnum_rows){ 4uint32_ttid=blockIdx.x*blockDim.x+threadIdx.x; 5 6uint32_ti=tid%Z;// for i in range(Z) 7uint32_tidx_col=tid/Z;// for idx_col in range(num_cols) 8 9uint32_tvn_degree=bg_vn_degree[idx_col];1011// list of tuples (idx_row = index_cn, s)12// idx_col = idx_vn and msg_offset omitted,13// msg spread out to idx_cn + idx_vn * num_cn14uint32_tconst*variable_nodes=&bg_vn[idx_col*max_degree];// len(vn) = vn_degree1516#if UNROLL_NODES > 117__shared__intmsg_sums[UNROLL_NODES][NODE_KERNEL_BLOCK+1];18msg_sums[threadIdx.y][threadIdx.x]=0;19#endif2021__syncwarp();2223intmsg_sum=0;24// accumulate all incoming LLRs25if(idx_col<num_cols){26for(uint32_tj=threadIdx.y;j<vn_degree;j+=UNROLL_NODES){27uint32_tvn=variable_nodes[j];2829// see packing layout above30uint32_tidx_row=vn&0xffffu;// note: little endian31uint32_ts=vn>>16;// ...32uint32_tmsg_offset=idx_row+idx_col*num_rows;3334// index of the msg in the LLR array35// it is the idx_col-th variable node, and the j-th message from the idx_row-th check node36uint32_tmsg_idx=msg_offset*Z+(i-s+(Z<<8))%Z;3738// accumulate all incoming LLRs39msg_sum+=llr_msg[msg_idx];40}41}4243// add the channel LLRs44__syncwarp();45if(threadIdx.y==0){46if(idx_col<num_cols)47msg_sum+=llr_ch[idx_col*Z+i];48}4950msg_sum=min(max(msg_sum,-MAX_LLR_ACCUMULATOR_VALUE),MAX_LLR_ACCUMULATOR_VALUE);5152#if UNROLL_NODES > 153msg_sums[threadIdx.y][threadIdx.x]=msg_sum;5455__syncthreads();5657if(threadIdx.y==0){58intmsg_sum=0;59for(inti=0;i<UNROLL_NODES;++i){60msg_sum+=msg_sums[i][threadIdx.x];61}62msg_sums[0][threadIdx.x]=msg_sum;63}6465__syncthreads();6667msg_sum=msg_sums[0][threadIdx.x];68#endif6970msg_sum=min(max(msg_sum,-MAX_LLR_ACCUMULATOR_VALUE),MAX_LLR_ACCUMULATOR_VALUE);7172__syncwarp();73if(idx_col<num_cols)74llr_total[idx_col*Z+i]=llr_accumulator_t(msg_sum);75}
As the OAI processing pipeline uses multiple threads, we need to ensure proper multi-threading support in the CUDA kernel.This is done via a thread specific context that is passed to the kernel.This ensures that each thread operates on its own CUDA stream and, thus, can be executed in parallel without interference.
1structThreadContext{ 2cudaStream_tstream=0; 3 4// Device memory declarations - use raw pointers instead of device symbols 5int8_t*llr_in_buffer=nullptr; 6uint8_t*llr_bits_out_buffer=nullptr; 7uint32_t*syndrome_buffer=nullptr; 8#ifndef USE_UNIFIED_MEMORY 9uint32_t*host_syndrome_buffer=nullptr;10#endif11llr_msg_t*llr_msg_buffer=nullptr;12llr_accumulator_t*llr_total_buffer=nullptr;1314#ifdef USE_GRAPHS15cudaGraphExec_tgraphCtx=nullptr;16#endif1718// list of thread contexts for shutdown19ThreadContext*next_initialized_context=nullptr;20};21static__threadThreadContextthread_context={};
Further, the decoder uses clipping values for the extrinsic messages and the VN accumulator.
1// add the channel LLRs 2__syncwarp(); 3if(threadIdx.y==0){ 4if(idx_col<num_cols) 5msg_sum+=llr_ch[idx_col*Z+i]; 6} 7 8msg_sum=min(max(msg_sum,-MAX_LLR_ACCUMULATOR_VALUE),MAX_LLR_ACCUMULATOR_VALUE); 910#if UNROLL_NODES > 111msg_sums[threadIdx.y][threadIdx.x]=msg_sum;1213__syncthreads();1415if(threadIdx.y==0){16intmsg_sum=0;17for(inti=0;i<UNROLL_NODES;++i){18msg_sum+=msg_sums[i][threadIdx.x];19}20msg_sums[0][threadIdx.x]=msg_sum;21}2223__syncthreads();2425msg_sum=msg_sums[0][threadIdx.x];26#endif2728msg_sum=min(max(msg_sum,-MAX_LLR_ACCUMULATOR_VALUE),MAX_LLR_ACCUMULATOR_VALUE);2930__syncwarp();31if(idx_col<num_cols)32llr_total[idx_col*Z+i]=llr_accumulator_t(msg_sum);
1// clip msg magnitudes to MAX_LLR_VALUE2min_1=min(max(min_1,-MAX_LLR_MSG_VALUE),MAX_LLR_MSG_VALUE);3min_2=min(max(min_2,-MAX_LLR_MSG_VALUE),MAX_LLR_MSG_VALUE);
Following the same principles, the CN update kernel is given as
1__launch_bounds__(UNROLL_NODES*NODE_KERNEL_BLOCK,3) 2static__global__voidupdate_cn_kernel(llr_accumulator_tconst*__restrict__llr_total,llr_msg_t*__restrict__llr_msg, 3uint32_tZ,uint32_tconst*__restrict__bg_cn,uint32_tconst*__restrict__bg_cn_degree,uint32_tmax_degree,uint32_tnum_rows, 4boolfirst_iter){ 5uint32_ttid=blockIdx.x*blockDim.x+threadIdx.x; 6 7uint32_ti=tid%Z;// for i in range(Z) 8uint32_tidx_row=tid/Z;// for idx_row in range(num_rows) 9 10uint32_tcn_degree=bg_cn_degree[idx_row]; 11 12// list of tuples (idx_col = idx_vn, s), 13// idx_row = idx_cn and msg_offset omitted, 14// msg spread out to idx_cn + idx_vn * num_cn 15uint32_tconst*check_nodes=&bg_cn[idx_row*max_degree];// len(cn) = cn_degree 16 17// search the "extrinsic" min of all incoming LLRs 18// this means we need to find the min and the second min of all incoming LLRs 19intmin_1=INT_MAX; 20intmin_2=INT_MAX; 21intidx_min=-1; 22intnode_sign=1; 23uint32_tmsg_signs=0;// bitset, 0 == positive; max degree is 19 24 25#if UNROLL_NODES > 1 26__shared__intmins1[UNROLL_NODES][NODE_KERNEL_BLOCK+1]; 27__shared__intmins2[UNROLL_NODES][NODE_KERNEL_BLOCK+1]; 28__shared__intidx_mins[UNROLL_NODES][NODE_KERNEL_BLOCK+1]; 29__shared__uint32_tsigns[UNROLL_NODES][NODE_KERNEL_BLOCK+1]; 30mins1[threadIdx.y][threadIdx.x]=INT_MAX; 31mins2[threadIdx.y][threadIdx.x]=INT_MAX; 32signs[threadIdx.y][threadIdx.x]=0; 33#endif 34 35__syncwarp(); 36 37if(idx_row<num_rows){ 38for(uint32_tii=threadIdx.y;ii<cn_degree;ii+=UNROLL_NODES){ 39uint32_tcn=check_nodes[ii]; 40 41// see packing layout above 42uint32_tidx_col=cn&0xffffu;// note: little endian 43uint32_ts=cn>>16;// ... 44uint32_tmsg_offset=idx_row+idx_col*num_rows; 45 46uint32_tmsg_idx=msg_offset*Z+i; 47 48// total VN message 49intt=llr_total[idx_col*Z+(i+s)%Z]; 50 51// make extrinsic by subtracting the previous msg 52if(!first_iter) 53t-=__ldg(&llr_msg[msg_idx]); 54 55// store sign for 2nd recursion 56// note: could be also used for syndrome-based check or early termination 57intsign=(t>=0?1:-1); 58node_sign*=sign; 59msg_signs|=(t<0)<<ii;// for later sign calculation 60 61// find min and second min 62intt_abs=abs(t); 63if(t_abs<min_1){ 64min_2=min_1; 65min_1=t_abs; 66idx_min=msg_idx; 67}elseif(t_abs<min_2) 68min_2=t_abs; 69} 70} 71 72#if UNROLL_NODES > 1 73mins1[threadIdx.y][threadIdx.x]=min_1; 74mins2[threadIdx.y][threadIdx.x]=min_2; 75idx_mins[threadIdx.y][threadIdx.x]=idx_min; 76signs[threadIdx.y][threadIdx.x]=msg_signs; 77 78__syncthreads(); 79 80if(threadIdx.y==0){ 81intmin_1=INT_MAX; 82intmin_2=INT_MAX; 83intidx_min=-1; 84uint32_tmsg_signs=0;// bitset, 0 == positive; max degree is 19 85for(inti=0;i<UNROLL_NODES;++i){ 86intt_abs=mins1[i][threadIdx.x]; 87if(t_abs<min_1){ 88min_2=min_1; 89min_1=t_abs; 90idx_min=idx_mins[i][threadIdx.x]; 91}elseif(t_abs<min_2) 92min_2=t_abs; 93min_2=min(min_2,mins2[i][threadIdx.x]); 94msg_signs|=signs[i][threadIdx.x]; 95} 96mins1[0][threadIdx.x]=min_1; 97mins2[0][threadIdx.x]=min_2; 98idx_mins[0][threadIdx.x]=idx_min; 99signs[0][threadIdx.x]=msg_signs;100}101102__syncthreads();103104min_1=mins1[0][threadIdx.x];105min_2=mins2[0][threadIdx.x];106idx_min=idx_mins[0][threadIdx.x];107msg_signs=signs[0][threadIdx.x];108109node_sign=(__popc(msg_signs)&1)?-1:1;110#endif111112// START marker-cnp-damping113// apply damping factor114min_1=APPLY_DAMPING_INT(min_1);// min_1 * DAMPING_FACTOR, e.g. *3/4115min_2=APPLY_DAMPING_INT(min_2);// min_2 * DAMPING_FACTOR, e.g. *3/4116// END marker-cnp-damping117118// clip msg magnitudes to MAX_LLR_VALUE119min_1=min(max(min_1,-MAX_LLR_MSG_VALUE),MAX_LLR_MSG_VALUE);120min_2=min(max(min_2,-MAX_LLR_MSG_VALUE),MAX_LLR_MSG_VALUE);121// END marker-vnp-clipping122123__syncwarp();124125// apply min and second min to the outgoing LLR126if(idx_row<num_rows){127for(uint32_tii=threadIdx.y;ii<cn_degree;ii+=UNROLL_NODES){128uint32_tcn=check_nodes[ii];129130// see packing layout above131uint32_tidx_col=cn&0xffffu;// note: little endian132uint32_tmsg_offset=idx_row+idx_col*num_rows;133134uint32_tmsg_idx=msg_offset*Z+i;135intmin_val;136if(msg_idx==idx_min)137min_val=min_2;138else139min_val=min_1;140141intmsg_sign=(msg_signs>>ii)&0x1?-1:1;142143// and update outgoing msg including sign144llr_msg[msg_idx]=llr_msg_t(min_val*node_sign*msg_sign);145}146}147}
Note that the choice of the clipping values is critical for the performance of the decoder, in particular as the decoder uses mostlysigned char variables. This keeps the memory footprint low and can be configured via the following macros
1staticconstintMAX_LLR_ACCUMULATOR_VALUE=127;2typedefint8_tllr_accumulator_t;3staticconstintMAX_LLR_MSG_VALUE=127;4typedefint8_tllr_msg_t;
For a given lifting factorZ and a number of rowsnum_rows and columnsnum_cols given from the BG selection, the following grid and block dimensions are used
1dim3threads(256);2// check node update3dim3blocks_cn(blocks_for(bg.num_rows*Z,threads.x));4// VN update5dim3blocks_vn(blocks_for(bg.num_cols*Z,threads.x));
1inline__host____device__uint32_tblocks_for(uint32_telements,intblock_size){2returnint(uint32_t(elements+(block_size-1))/uint32_t(block_size));3}
Note that the grid and block dimensions are architecture dependent and should be tuned for the specific GPU.
After decoding, the output is hard-decided and packed into a byte-array.
1__launch_bounds__(PACK_BITS_KERNEL_THREADS,6) 2static__global__voidpack_bits_kernel(llr_accumulator_tconst*__restrict__llr_total,uint8_t*__restrict__bits,uint32_tblock_length){ 3uint32_ttid=blockIdx.x*blockDim.x+threadIdx.x; 4 5uint32_tcoop_byte=0; 6// 1 bit per thread 7if(tid<block_length) 8coop_byte=(llr_total[tid]<0)<<(7-(threadIdx.x&7));// note: highest to lowest bit 910// use fast lane shuffles to assemble one byte per group of 8 adjacent threads11coop_byte+=__shfl_xor_sync(0xffffffff,coop_byte,1);// xxyyzzww12coop_byte+=__shfl_xor_sync(0xffffffff,coop_byte,2);// xxxxyyyy13coop_byte+=__shfl_xor_sync(0xffffffff,coop_byte,4);// xxxxxxxx1415// share bytes across thread group to allow one coalesced write by first N threads16__shared__uint32_tbit_block_shared[PACK_BITS_KERNEL_THREADS/8];17if((threadIdx.x&0x7)==0)18bit_block_shared[threadIdx.x/8]=coop_byte;1920__syncthreads();2122// the first (PACK_BITS_KERNEL_THREADS / 8) threads pack 8 bits each23if(threadIdx.x<PACK_BITS_KERNEL_THREADS/8&&blockIdx.x*PACK_BITS_KERNEL_THREADS+threadIdx.x*8<block_length){24bits[blockIdx.x*PACK_BITS_KERNEL_THREADS/8+threadIdx.x]=bit_block_shared[threadIdx.x];25}26}
Note
The decoder returns the number of iterations and declares a decoding failure by returningmax_num_iter if the CRC or the syndrome check fails.
Testing
Unit tests verify the CUDA decoder against Sionna’s reference implementation usingnanobind for Python-CUDA binding. The tests cover multiple code rates for both base graphs (BG1 and BG2).
cdplugins/ldpc_cudapytesttests/unit/The integration test runs the full 5G stack with the CUDA decoder and verifies end-to-end operation via iperf3 traffic.
cdplugins/ldpc_cuda/tests/integration./run_integration.shSeeplugins/ldpc_cuda/tests/ for the full test suite.
Outlook - Weighted Belief Propagation
An extension ofclassical belief propagation is theweighted belief propagation[Nachmani2016] using gradient descent to optimize trainable parameters during message passing. An example implementation can be found in the Sionna tutorial onWeighted Belief Propagation.
When looking at the current implementation of the CN update, we can already see a damping factor of 3/4 applied to each outgoing CN message[Pretti2005].
1#define APPLY_DAMPING_INT(x) (x*3/4)
1// apply damping factor2min_1=APPLY_DAMPING_INT(min_1);// min_1 * DAMPING_FACTOR, e.g. *3/43min_2=APPLY_DAMPING_INT(min_2);// min_2 * DAMPING_FACTOR, e.g. *3/4
One could now implement a weighted CN update by simply replacing the damping factor with a trainable parameter from the weighted BP tutorial.
We hope you enjoyed this tutorial. You are now well equipped to accelerate other compute intensive parts of the 5G stack as well by following the same principles.