Background Low-density Parity-check Codes

We start with a Python version of the 5G NR LDPC decoder architecture, which is then used as blueprint for the later CUDA implementation. If you are already familiar with the concept of belief propagation (BP) decoding, you can skip this part of the tutorial.

LDPC Basics

Fig. 1: Parity-check matrix and decoding graph of the (7,4) Hamming code.

Background: Channel Coding in 5G

In 5G NR, there are two major channel coding schemes:Low-density Parity-check (LDPC) and Polar codes. LDPC codes are used for the data channels while Polar codes are used for the control channels. This tutorial focuses only on the LDPC decoder as this is one of the most compute-intensive components in the 5G stack.

LDPC codes have been invented in 1963 by Robert G. Gallager but were long forgotten due to their high decoding complexity. After their rediscovery by MacKay in 1996, they have become the workhorse of many modern communication standards, including 5G NR.

The core idea of LDPC decoding is an iterative algorithm based on belief propagation. These iterations can be easily parallelized as each processing node is independent of the others, making them an ideal candidate for GPU acceleration. For further details on LDPC codes in OAI, we refer to [Romani2020].

A machine learning enhanced version of the LDPC decoder has been proposed in [Nachmani2016] and is available asSionna Tutorial on Weighted Belief Propagation Decoding.

Overview Decoder Implementation

LDPC Acceleration Overview

Fig. 2: Overview of the LDPC BP decoding algorithm.

An overview of the LDPC BP decoding algorithm is shown in Fig. 2 above. The core decoding algorithm is implemented in theupdate_cn_kernel(.) andupdate_vn_kernel(.) functions. Both kernels are iteratively executed and perform the check node (CN) and variable node (VN) updates, respectively. The decoder stops when the maximum number of iterations is reached. An additional early stopping condition could also be applied to reduce the average number of iterations.

pack_bits_kernel(.) maps the soft-values to hard-decided bits and packs them into a more compact byte-representation which is required for the OAI processing pipeline.

Note that the decoder is implemented usingint datatypes instead offloat. This ensures compatibility with OAI.

The Python code of this tutorial can be also found inplugins/ldpc_cuda/src/python/numpy_decoder.py

Python Imports

Let us now import the relevant libraries. We use theLDPC5GEncoder from Sionna to simplify the code construction. Sionna is not required to run the final CUDA decoder.

[ ]:

# Import the required librariesimportos# Configure which GPUifos.getenv("CUDA_VISIBLE_DEVICES")isNone:gpu_num=0# Use "" to use the CPUos.environ["CUDA_VISIBLE_DEVICES"]=f"{gpu_num}"# Set the TF log level to only show errorsos.environ['TF_CPP_MIN_LOG_LEVEL']='3'importnumpyasnpimportmatplotlib.pyplotasplt# Import Sionna LDPCEncoder LDPC code definitionsfromsionna.phy.fec.ldpcimportLDPC5GEncoder,LDPC5GDecoder

Basegraph selection

In 5G NR, the LDPC code is defined by two basegraphs (bg) each with 8 different realizations (selected by the lifting set indexils) depending on the code length and rate. The basegraph selection must be done during each decoding execution as the code parameters can dynamically change.

The LDPC code construction itself is a quasi-cyclic (QC) LDPC code with lifting factor\(Z\) which takes values between 2,…,384. For details see [Richardson2018].

[2]:

defget_bg(ils,bg,verbose=False):"""Pre-generates basegraph description for the given lifting set index and basegraph number.    This can be precomputed and stored before the decoding process.    Parameters    ----------    ils : int | 0,...,7        lifting set index of bg as defined in 38.212 Tab 5.3.2-1.    bg : int | 1,2        Basegraph number    verbose : bool        If True, additional information is printed    Returns    -------    bg_vn : list of tuples        Each tuple contains the variable node index, the check node index, the cyclic shift and the offset for the memory access of the message.    bg_cn : list of tuples        Each tuple contains the check node index, the variable node index, the cyclic shift and the offset for the memory access of the message.    bg_vn_degree : list        List of variable node degrees.    bg_cn_degree : list        List of check node degrees.    """ifbg==1:bg="bg1"else:bg="bg2"# Use sionna to load the basegraphenc=LDPC5GEncoder(12,24)# dummy encodermat_ref=enc._load_basegraph(ils,bg)########################################################## Generate bg compact description########################################################## From VN perspectivebg_vn=[]msg_offset=0# Counter how many messages blocks have been passed alreadyforidx_vninrange(mat_ref.shape[1]):t=[]foridx_cninrange(mat_ref.shape[0]):ifmat_ref[idx_cn,idx_vn]!=-1:t.append((idx_vn,idx_cn,int(mat_ref[idx_cn,idx_vn]),msg_offset))msg_offset+=1bg_vn.append(t)ifverbose:print(bg_vn)# From CN perspectivebg_cn=[]foridx_cninrange(mat_ref.shape[0]):t=[]foridx_vninrange(mat_ref.shape[1]):ifmat_ref[idx_cn,idx_vn]!=-1:# Find message offset from VN perspective# Find matching entry in bg_vn to get message offsetforvn_entryinbg_vn[idx_vn]:ifvn_entry[1]==idx_cn:msg_offset=vn_entry[3]breakt.append((idx_cn,idx_vn,int(mat_ref[idx_cn,idx_vn]),msg_offset))bg_cn.append(t)ifverbose:print(bg_cn)bg_vn_degree=[len(bg_vn[i])foriinrange(len(bg_vn))]bg_cn_degree=[len(bg_cn[i])foriinrange(len(bg_cn))]ifverbose:print(bg_vn_degree)print(bg_cn_degree)returnbg_vn,bg_cn,bg_vn_degree,bg_cn_degree

This can be be precomputed for\(bg=1\) and\(bg=2\) and for all possible values of\(ils=0,...,7\) leading to 14 different code definitions. For the sake of readability of this Python implementation, we re-run the above function during each call of the decoding function.

Before decoding, the decoder needs to load the exact code configuration for the given basegraph\(bg\) and lifting factor\(z\).

[3]:


definit_basegraph(bg,z):"""Initializes the basegraph, its dimensions and number of edges/messages.    Parameters    ----------    bg : int | 1,2        Basegraph number    z : int | 2,...,384        Lifting factor    Returns    -------    bg_vn : list of tuples        Each tuple contains the variable node index, the check node index, the cyclic shift and the offset for the memory access of the message.    bg_cn : list of tuples        Each tuple contains the check node index, the variable node index, the cyclic shift and the offset for the memory access of the message.    bg_vn_degree : list        List of variable node degrees.    bg_cn_degree : list        List of check node degrees.    num_cols : int        Number of variable nodes.    num_rows : int        Number of check nodes.    num_edges : int        Number of edges/messages in the graph.    """# These values are hard-coded for the basegraphs bg1 and bg2# They are defined the 38.212.ifbg==1:num_rows=46num_cols=68num_nnz=316# Num non-zero elements in bgelse:# bg2num_rows=42num_cols=52num_nnz=197# Num non-zero elements in bg# Number of variable nodesnum_vns=num_cols*z# Number of check nodesnum_cns=num_rows*z# Number of edges/messages in the graphnum_edges=num_nnz*z# Lifting set according to 38.212 Tab 5.3.2-1s_val=[[2,4,8,16,32,64,128,256],[3,6,12,24,48,96,192,384],[5,10,20,40,80,160,320],[7,14,28,56,112,224],[9,18,36,72,144,288],[11,22,44,88,176,352],[13,26,52,104,208],[15,30,60,120,240]]# Find lifting set indexils=-1foriinrange(len(s_val)):forjinrange(len(s_val[i])):ifz==s_val[i][j]:ils=ibreak# This case should not happenassertils!=-1,"Lifting factor not found in lifting set"# Load base graph; this will become a lookup table in CUDAbg_vn,bg_cn,bg_vn_degree,bg_cn_degree=get_bg(ils,bg)returnbg_vn,bg_cn,bg_vn_degree,bg_cn_degree,num_cols,num_rows,num_edges

Memory Layout

The resulting parity-check matrix after lifting can be visualized using the code below. Each blue dot represents a connection from a check node to a variable node and, thus, an outgoing message from a check node to a variable node which needs to be stored in the message buffer.

[4]:

# Example code constructionk=8448n=368*68enc=LDPC5GEncoder(k,n)dec=LDPC5GDecoder(enc)plt.spy(dec._pcm,markersize=0.3);plt.xlabel('Variable Nodes');plt.ylabel('Check Nodes');plt.title(f'Basegraph{enc._bg}, Lifting factor:{enc._z}');plt.show();

../../../_images/rk_tutorials_ldpc_cuda_LDPC_background_8_0.png

During decoding, it requires to update three different message buffers:

llr_channel: This buffer stores the LLRs from the channel. This is of size\(Z*68\) for\(bg=1\) and\(Z*52\) for\(bg=2\).
llr_total: This buffer accumulates the LLRs from the variable nodes. This is of size\(Z*68\) for\(bg=1\) and\(Z*52\) for\(bg=2\).
llr_msg: This buffer stores the messages from the check nodes to the variable nodes. This is of size\(Z*316\) for\(bg=1\) and\(Z*197\) for\(bg=2\).

In practice, many VNs are punctured due to rate-matching, allowing decoders to skip updates for these unused nodes and operate more efficiently.

CN Update Function

We are using themin-sum approximation for the check node update, given as

\[\qquad y_{j \to i} = \alpha_{j \to i} \cdot \beta \cdot {min}_{i' \in \mathcal{N}(j) \setminus i} \left(|x_{i' \to j}| \right)\]

where\(y_{j \to i}\) denotes the message from check node (CN)\(j\) to variable node (VN)\(i\) and\(x_{i \to j}\) from VN\(i\) to CN\(j\), respectively. Further,\(\mathcal{N}(j)\) denotes all indices of connected VNs to CN\(j\) and

\[\alpha_{j \to i} = \prod_{i' \in \mathcal{N}(j) \setminus i} \operatorname{sign}(x_{i' \to j})\]

is the sign of the outgoing message. The parameter\(\beta\) defines the damping factor which is used to scale the outgoing CN messages. For further details we refer to [Chen2005].

[5]:

# In the later CUDA implementation, we use int8 dtypes for the LLRs.# Thus, we need to clip the LLRs to the range [-127, 127]MAX_LLR_VALUE=127# depends on dtype# Damping(=scaling) factor of the outgoing LLRs to improve the decoding performance.# This is a heuristic value and typically a good choice.DAMPING_FACTOR=0.75defupdate_cn(llr_msg,llr_total,z,bg_cn,bg_cn_degree,num_rows,first_iter):"""    Inputs    ------    llr_msg: np.array [num_edges]        Incoming LLRs from variable nodes    llr_total: np.array [num_vns]        Accumulated LLRs from variable nodes    z: int | 2,...,384        Lifting factor    bg_cn: list of tuples        Check node configuration    bg_cn_degree: list        Check node degree    num_rows: int        Number of rows in the base graph    first_iter: bool        Whether this is the first iteration    Returns    -------    llr_msg: np.array [num_edges]        Updated LLRs from check nodes    """# Check node update function# These two loops can be parallelized in CUDAforiinrange(z):# between 2 and 384foridx_rowinrange(num_rows):# Either 46 or 42cn_degree=bg_cn_degree[idx_row]# Check node degree# List of tuples (idx_row, idx_col, s, msg_offset)cn=bg_cn[idx_row]# len(cn) = cn_degree# Search the "extrinsic" min of all incoming LLRs# This means we need to find the min and the second min of all incoming LLRsmin_1=MAX_LLR_VALUEmin_2=MAX_LLR_VALUEidx_min=-1node_sign=1# Temp buffer for signsmsg_sign=np.ones((19,))# Max CN degree is 19foriiinrange(cn_degree):# Calculate the index of the message in the LLR arraymsg_offset=cn[ii][3]s=cn[ii][2]idx_col=cn[ii][1]msg_idx=msg_offset*z+i# Total VN messaget=llr_total[idx_col*z+(i+s)%z]# Make extrinsic by subtracting the previous msgifnotfirst_iter:# Ignore in first iterationt-=llr_msg[msg_idx]# Store sign for 2nd recursionsign=1ifnp.abs(t)==0elsenp.sign(t)# Could be also used for syndrome-based check or early terminationnode_sign*=signmsg_sign[ii]=sign# For later sign calculation# Find min and second mint_abs=np.abs(t)ift_abs<min_1:min_2=min_1min_1=t_absidx_min=msg_idxelift_abs<min_2:min_2=t_abs# Apply damping factormin_1*=DAMPING_FACTORmin_2*=DAMPING_FACTOR# Clip min_val to MAX_LLR_VALUEmin_1=np.clip(min_1,-MAX_LLR_VALUE,MAX_LLR_VALUE)min_2=np.clip(min_2,-MAX_LLR_VALUE,MAX_LLR_VALUE)# Apply min and second min to the outgoing LLRforiiinrange(cn_degree):msg_offset=cn[ii][3]msg_idx=msg_offset*z+iifmsg_idx==idx_min:min_val=min_2else:min_val=min_1# And update outgoing msg including signllr_msg[msg_idx]=min_val*node_sign*msg_sign[ii]returnllr_msg

VN Update Function

The variable node update function sums all incoming messages from the check nodes and adds the channel LLRs

\[x_{i \to j} = \sum_{j' \in \mathcal{M}(i) \setminus j} y_{j' \to i} + \lambda_i\]

where\(\lambda_i\) denotes the channel LLR associated to the\(i\)-th VN and\(\mathcal{M}(i) \setminus j\) denotes the set of check nodes connected to VN\(i\) excluding the check node\(j\) itself.

[6]:

defupdate_vn(llr_msg,llr_ch,llr_total,z,bg_vn,bg_vn_degree,num_cols):"""    Inputs    ------    llr_msg: np.array [num_edges]        Incoming LLRs from check nodes    llr_ch: np.array [num_vns]        Channel LLRs    llr_total: np.array [num_vns]        Accumulated LLRs from variable nodes    z: int | 2,...,384        Lifting factor    bg_vn: list of tuples        Variable node configuration    bg_vn_degree: list        Variable node degree    num_cols: int        Number of variable nodes    Returns    -------    llr_total: np.array [num_vns]        Updated LLRs from variable nodes    """# This can be parallelized in CUDAforiinrange(z):# Between 2 and 384foridx_colinrange(num_cols):# Either 52 or 68vn_degree=bg_vn_degree[idx_col]# Variable node degree# List of tuples (idx_col, idx_row, s, msg_offset)vn=bg_vn[idx_col]# len(vn) = vn_degree# Accumulate all incoming LLRsmsg_sum=0# Should be int16forjinrange(vn_degree):msg_offset=vn[j][3]s=vn[j][2]# Index of the msg in the LLR array# It is the idx_col-th variable node, and the j-th message from the idx_row-th check nodemsg_idx=msg_offset*z+(i-s)%z# Accumulate all incoming LLRsmsg_sum+=llr_msg[msg_idx].astype(np.int16)# Add the channel LLRsmsg_sum+=llr_ch[idx_col*z+i].astype(np.int16)llr_total[idx_col*z+i]=msg_sumreturnllr_total

Hard-decision

The final LLRs can now be hard-decided to bits. Further, OAI requires to return bytes instead of bit values.

[7]:

defpack_bits(llr_total,block_length,pack_bits=False):"""    Inputs    ------    llr_total: np.array [num_vns]        LLRs from variable nodes    block_length: int        Number of payload bits that are returned after decoding    pack_bits: bool        If True, the bits are packed into a byte array    Returns    -------    bits: np.array [block_length]        Decoded bits    """# OAI wants the bits in a byte arrayifpack_bits:# Round length to the nearest multiple of 8block_length=int(np.ceil(block_length/8)*8)bits=(llr_total[:block_length]<0).astype(np.uint8)bits=np.packbits(bits)else:bits=(llr_total[:block_length]<0).astype(np.uint8)returnbits

Main Decoding Function

We can now run the full LDPC decoder by iteratively updating VN and CN update functions as depicted in Fig. 2.

[8]:

defdecode_ldpc(bg,z,block_length,num_iter,llr_ch):"""    Inputs    ------    bg: int | 1,2        Basegraph used for decoding    z: int | 2,...,384        Lifting factor    block_length: int        Number of payload bits that are returned after decoding    num_iter: int        Max number of decoding iterations    llr_ch: np.array [68+z] or [52*z] for bg1 and bg2, respectively        Received channel LLRs    """############################# Initialize Variables############################bg_vn,bg_cn,bg_vn_degree,bg_cn_degree,num_cols,num_rows,num_edges=init_basegraph(bg,z)# Temporary message buffer# Max size is 316*384llr_msg=np.zeros((num_edges,),dtype=np.int8)# No need to initialize to 0# VN accumulator# We always init the max size of the LLR array given as 68*384# The accumulator needs higher precision than the message bufferllr_ch=llr_ch.astype(np.int8)############################# Main Decoding Loop############################llr_total=np.copy(llr_ch).astype(np.int16)# llr_total will be updated in the VN updateforiinrange(num_iter):# CN update# llr_msg not read, only written to in first iteration; will be filled with outputs of this functionllr_msg=update_cn(llr_msg,llr_total,z,bg_cn,bg_cn_degree,num_rows,i==0)# VN updatellr_total=update_vn(llr_msg,llr_ch,llr_total,z,bg_vn,bg_vn_degree,num_cols)# Pack bitsbits=pack_bits(llr_total,block_length,pack_bits=True)# [optional] apply syndrome check and if not successful return `num_iter+=1``returnbits,llr_total

Run and Test the Decoder

We can now run an end-to-end test of the decoder using the SionnaLDPC5GEncoder. Note that de-rate matching at the receiver must be done manually before decoding. This will be handled by the OAI stack in the future.

[9]:

# Parametersk=1280# Payload sizen=2560# Codeword size after rate matchingnum_iter=8# Number of decoding iterationsllr_mag=10# LLR magnitude# Use Sionna for BG selectionenc=LDPC5GEncoder(k,n)bg=1ifenc._bg=="bg1"else2z=enc._z# Draw random bitsu=np.random.randint(0,2,k)# Encode using Sionnac=enc(np.expand_dims(u.astype(np.float32),0))[0].numpy()# Map to LLRsx=1-2*c# Clip to max LLR value due to int8 representationllr_ch=np.clip((llr_mag*x),-127,127).astype(np.int8)# Get code parametersnum_vn=68*zifbg==1else52*zparity_start=22*zifbg==1else10*z# Apply rate matchingllr_input=np.zeros(num_vn,dtype=np.int8)# Takes care of the punctured bits (initialized to 0 LLR)llr_input[2*z:k]=llr_ch[:k-2*z]# Unpunctured message bitsllr_input[k:parity_start]=127# Shortened bitsllr_input[parity_start:parity_start+n-k+2*z]=llr_ch[k-2*z:]# parity bits# Run decoderdecoder_outputs,_=decode_ldpc(bg,z,k,num_iter,llr_input)# Evaluate resultsu_hat=np.unpackbits(decoder_outputs.astype(np.uint8))[:k]# There may be a couple of extra bits if k is not a multiple of 8ifnp.all(u==u_hat):print("All payload bits decoded correctly!")else:print("Decoding failed!")print(f"Shape of LLR input:{llr_input.shape}")print(f"Shape of decoded bytes:{decoder_outputs.shape}")

All payload bits decoded correctly!Shape of LLR input: (6656,)Shape of decoded bytes: (160,)

We are now ready to start with the CUDA implementation!

Movatterモバイル変換