Movatterモバイル変換


[0]ホーム

URL:


CN119996539A - Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture - Google Patents

Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture
Download PDF

Info

Publication number
CN119996539A
CN119996539ACN202510053378.6ACN202510053378ACN119996539ACN 119996539 ACN119996539 ACN 119996539ACN 202510053378 ACN202510053378 ACN 202510053378ACN 119996539 ACN119996539 ACN 119996539A
Authority
CN
China
Prior art keywords
processing unit
data packet
load
risc
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510053378.6A
Other languages
Chinese (zh)
Inventor
陈柏龄
贺冠博
潘俊冰
黄安妮
廖邓彬
符嘉成
莫晓盈
冯露葶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Power Grid Co Ltd
Original Assignee
Guangxi Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Power Grid Co LtdfiledCriticalGuangxi Power Grid Co Ltd
Priority to CN202510053378.6ApriorityCriticalpatent/CN119996539A/en
Publication of CN119996539ApublicationCriticalpatent/CN119996539A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

The application belongs to the technical field of power systems, and relates to an RDMA protocol high-efficiency analysis and forwarding method based on a RISC-V architecture, which realizes hardware acceleration of protocol analysis and verification process by introducing a hardware acceleration module and a customized RISC-V instruction set, thereby greatly improving analysis speed and reducing CPU load, and in the protocol analysis process of a data packet, the instruction set is optimized for special requirements of the RDMA protocol by utilizing the flexibility and the customizability of the RISC-V architecture, so that packet head information can be extracted efficiently and CRC verification can be carried out, and the load condition of each calculation unit can be monitored in real time by a hardware scheduling module through dynamic flow scheduling and load balancing, thereby intelligently adjusting flow distribution, avoiding overload problem of a single calculation node and ensuring high-efficiency forwarding of the data packet. In addition, based on the customized processing capacity of the RISC-V architecture, the forwarding path of the data packet can be flexibly adjusted according to real-time network topology and routing information, and the adaptability and the processing capacity of the whole network are improved.

Description

RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture
Technical Field
The application belongs to the technical field of power systems, and particularly relates to an RDMA protocol efficient analysis and forwarding method based on a RISC-V architecture.
Background
With the continuous development of smart grids and modern energy management systems, the grids gradually move to digitization, informatization and intellectualization. The introduction of the technologies enables the power grid to have the capabilities of real-time monitoring, remote control and optimal scheduling. However, the complexity of the power grid has increased, and how to efficiently parse and forward the power grid data stream has become an important issue, especially when dealing with large amounts of real-time data and high frequency communications.
Communication systems in smart grids generally adopt a distributed architecture, and various smart devices and control units (such as a transformer substation, a power distribution network, a load control device and the like) need to be interconnected and intercommunicated through a high-speed network. In order to improve communication efficiency and reduce latency, many modern grid communication systems employ Remote Direct Memory Access (RDMA) technology that is capable of providing high bandwidth, low latency data transfer, reducing processor intervention during data transfer, and improving data processing efficiency.
Data transmission in the power grid mostly depends on network protocols such as TCP/IP, UDP or customized grid protocols.
It is conventional practice to parse and process data packets by running complex protocol stacks on the CPU. The protocol stacks are responsible for receiving data packets from the network, performing protocol analysis, checksum data processing, and determining forwarding paths of the data packets. This approach, while popular, is limited by CPU performance, especially when dealing with high frequency, high traffic data, where the performance bottleneck is significant.
Disclosure of Invention
The invention provides an RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture, which aims to solve the technical problem that the performance bottleneck is obvious when the CPU performance is limited at present, especially when high-frequency and large-flow data are processed.
RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture includes the following steps:
step 1, a data packet acquired by an edge node reaches a local computing node through an RDMA network interface;
Step 2, analyzing and checking RDMA protocol of the data packet through a hardware acceleration module, extracting head information through a customized RISC-V instruction set and performing CRC check;
Step 3, the parsed and verified data packet is transmitted to a circulation module according to the target information, and the local computing node determines a forwarding path of the data packet according to the routing table and the topology information;
step 4, based on the determined forwarding path, the data packet is forwarded to a target processing unit or a network interface, wherein if the forwarding target is a local computing node, the data packet is stored in a cache;
And 5, in the data packet forwarding process, the hardware scheduling module dynamically adjusts flow distribution according to the current load condition and controls the processing load of each processing unit to avoid overload.
The RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture provides an effective solution for the performance bottleneck problem caused by the CPU performance limitation in the current high-frequency and large-flow data processing. By introducing a hardware acceleration module and a customized RISC-V instruction set, the method realizes hardware acceleration in protocol analysis and verification processes, thereby greatly improving analysis speed and reducing CPU load. In the protocol analysis process of the data packet, the scheme optimizes an instruction set aiming at the special requirements of the RDMA protocol by utilizing the flexibility and the customizability of the RISC-V architecture, and can efficiently extract the information of the data packet head and perform CRC (cyclic redundancy check). Meanwhile, through dynamic flow scheduling and load balancing, the hardware scheduling module can monitor the load condition of each computing unit in real time, intelligently adjust flow distribution, avoid the overload problem of a single computing node and ensure the efficient forwarding of data packets. In addition, based on the customized processing capacity of the RISC-V architecture, the forwarding path of the data packet can be flexibly adjusted according to real-time network topology and routing information, and the adaptability and processing capacity of the whole network are improved, so that the problem that the traditional processing scheme cannot meet the performance requirement under the conditions of high flow and high concurrency is effectively solved.
Preferably, the format of the data packet includes a protocol header, a data payload and a trailer;
The protocol header includes a source address, a destination address, a protocol type, and a data length;
The data payload comprises data content carrying a transmission;
The trailer contains CRC check information for the data packet.
Preferably, the step 2 includes the steps of:
After the hardware interface RDMA NIC receives the data packet, the data is stored into a temporary DMA buffer;
the customized RISC-V instruction set and protocol analysis is that the source address, the target address, the protocol type and the data length of the data packet are analyzed based on the customized RISC-V instruction set;
Defining a set of instructions based on the RISC-V instruction set to quickly calculate a CRC check value:
wherein d=d0,d1,…,dn―1 represents the content of the data packet, and CRC is calculated by bits; INITIALCRC represents the CRC initial value;
Initializing the CRC value to be 0xFFFFFFFF, performing exclusive OR operation on each byte of each data packet, shifting and updating the data by a table look-up method, reversing the CRC value after calculation, comparing the CRC value with a check value in CRC check information, wherein consistency indicates that the check passes, and non-consistency indicates that the check fails;
extracting information and error processing, namely extracting a protocol header field and a CRC check result, and discarding the data packet if the check fails.
Preferably, the step 3 includes the steps of:
The routing table searching and path calculating comprises the steps of matching a target IP address with each item in the routing table, selecting an item matched with the longest prefix, obtaining the next-hop information of the matched item, and determining the forwarding path of the data packet according to the matching result;
the routing table information stores the routing information through a hardware Trie, so that quick searching is realized.
Preferably, the step of determining the forwarding path based on the hardware Trie tree is as follows:
transmitting the input target IP address to the Trie bit by bit;
starting from the root node of the Trie, matching is performed according to each bit of the target IP address:
If the current bit of the target IP is 0, searching is continued along the left subtree;
If the current target IP is 1, searching along the right subtree;
recording the routing information of a node when the node is queried;
and after all the bits are queried, returning the routing information corresponding to the longest matching prefix.
Preferably, the step 5 includes the steps of:
The hardware scheduling module acquires real-time load data of each processing unit by periodically monitoring the current load condition of each processing unit, wherein the real-time load data comprises the current queue length, the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate;
Judging a threshold value, namely if the load of a certain processing unit exceeds a preset upper limit threshold value, the processing unit is regarded as an overload state and needs to be reassigned;
If the load is lower than a preset lower threshold, the processing unit is regarded as an idle state;
task migration, namely migrating a data packet from a processing unit in an overload state to a processing unit in an idle state through a hash algorithm;
Flow segmentation, namely, a hardware scheduling module distributes flow according to the load condition of each processing unit to realize load balancing:
Where Pin denotes the number of packets to be forwarded, Pout,i denotes the number of packets allocated to the ith processing unit, Li denotes the load value of the ith processing unit, Lmax denotes the maximum load value, and N denotes the number of processing units.
Preferably, the step of implementing task migration by the hash algorithm is as follows:
Load sense value calculation, calculating a load sense value for each processing unit:
LAHVi=w1×CPUi+w2×QueueLengthi+w3×Memoryi+w4×Bandwidthi;
Wherein LAHVi denotes a load sensing value of the ith processing unit, CPUi denotes a CPU utilization rate of the ith processing unit, queueLengthi denotes a queue length of the ith processing unit, memoryi denotes a Memory utilization rate of the ith processing unit, bandwidthi denotes a Bandwidth utilization rate of the ith processing unit, and w1、w2、w3、w4 respectively denote weight coefficients;
calculating hash values by applying a hash function to the load sense values of each processing unit, mapping the load sense values to a location of the hash ring:
Hashi=HashFunction(LAHVi);
Wherein Hashi represents the Hash value of the ith processing unit;
Ring construction, mapping each processing unit to a location of the hash ring;
Calculating a hash value of the data packet by using a certain identifier of the data packet:
TaskHash=HashFuntion(PacketIdentifier);
Wherein TaskHash denotes a data packet hash value, PACKETIDENTIFIER denotes an identifier of the data packet;
Searching a processing unit on the hash ring, namely searching the next processing unit clockwise from the hash value on the hash ring, and according to the size of the hash value, searching the processing unit with the lightest load and matched hash value from the hash ring to perform task migration;
After the data packet is migrated to the target processing unit, the load sensing value of the target processing unit needs to be updated, and the node position on the hash ring is dynamically adjusted by periodically calculating the load sensing value of each processing unit.
The beneficial effects of the invention include:
The RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture provides an effective solution for the performance bottleneck problem caused by the CPU performance limitation in the current high-frequency and large-flow data processing. By introducing a hardware acceleration module and a customized RISC-V instruction set, the method realizes hardware acceleration in protocol analysis and verification processes, thereby greatly improving analysis speed and reducing CPU load. In the protocol analysis process of the data packet, the scheme optimizes an instruction set aiming at the special requirements of the RDMA protocol by utilizing the flexibility and the customizability of the RISC-V architecture, and can efficiently extract the information of the data packet head and perform CRC (cyclic redundancy check). Meanwhile, through dynamic flow scheduling and load balancing, the hardware scheduling module can monitor the load condition of each computing unit in real time, intelligently adjust flow distribution, avoid the overload problem of a single computing node and ensure the efficient forwarding of data packets. In addition, based on the customized processing capacity of the RISC-V architecture, the forwarding path of the data packet can be flexibly adjusted according to real-time network topology and routing information, and the adaptability and processing capacity of the whole network are improved, so that the problem that the traditional processing scheme cannot meet the performance requirement under the conditions of high flow and high concurrency is effectively solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a block diagram of overall steps provided in an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Referring to fig. 1, a preferred embodiment of the present invention will be further described;
RDMA protocol high-efficiency analysis and forwarding method based on RISC-V architecture includes the following steps:
In step 1, a data packet acquired by an edge node reaches a local computing node through an RDMA network interface, for example, a large number of sensors are installed in an intelligent substation to monitor parameters such as current, voltage, power and the like. These data are transmitted to the central control system via the RDMA protocol. The control system monitors abnormal current rise in a certain area in real time, immediately sends a control command to adjust the load of the area, and starts the standby power supply. At the same time, the system will also transmit real-time data to the dispatcher via RDMA protocol for further analysis and processing.
Step 2, analyzing and checking RDMA protocol of the data packet through a hardware acceleration module, extracting head information through a customized RISC-V instruction set and performing CRC check;
Data packet receiving and preliminary analysis:
After the packet reaches the local compute node over the RDMA network interface, it is passed directly to a hardware acceleration module (e.g., dedicated RDMA network interface card, DMA controller, etc.) for preliminary processing.
Packet format first, the basic format of the packet includes a protocol header, a data payload, and a trailer (typically including verification information).
The protocol header contains key information such as source address, target address, protocol type, data length, etc.
And data load, namely carrying the transmitted data content.
And the tail part comprises CRC check information of the data packet.
The data packets are buffered in a resident DMA buffer, where after the data packets are received by a hardware interface (e.g., RDMA NIC), the data is first stored in a temporary DMA buffer for further processing.
Customized RISC-V instruction set and protocol resolution:
With the customized RISC-V instruction set, the hardware acceleration module is able to efficiently extract protocol header information in RDMA packets and perform CRC checks. The specific implementation process is as follows:
And the protocol header is analyzed, namely the customized RISC-V instruction set can directly access the memory location of the data packet, and each field of the protocol header is rapidly analyzed. The following is a typical parsing step of the protocol header:
source address resolution, extracting the source address field in the packet, typically a 16 byte IPv6 address or a 4 byte IPv4 address.
Target address resolution, extracting target address field.
Protocol type parsing, extracting protocol type fields (e.g., RDMA write, RDMA read, RDMA send, etc.).
And analyzing the data length, namely extracting the length information of the data load.
Custom instructions for example, in RISC-V, a special instruction may be designed to extract the 16-byte source address directly from the packet buffer:
ld r1, [ r0]// load 16 bytes of data (source address) from memory address r0 to register r1;
CRC check-CRC check of a data packet is an important step to ensure data integrity and correctness. The RISC-V instruction set can customize a group of instructions to quickly calculate CRC check values, reducing CPU processing overhead.
Assuming that the CRC check is based on a standard CRC-32 (a common 32-bit CRC algorithm), the calculation formula is as follows:
wherein d=d0,d1,…,dn―1 represents the content of the data packet, and CRC is calculated by bits; KNITIALCRC represents the CRC initial value;
Initializing the CRC value to be 0xFFFFFFFF, performing exclusive OR operation on each byte of each data packet, shifting and updating the data by a table look-up method, reversing the CRC value after calculation, comparing the CRC value with a check value in CRC check information, wherein consistency indicates that the check passes, and non-consistency indicates that the check fails;
Custom instructions may accomplish this by specialized hardware modules. Using hardware-accelerated CRC computation, efficiency can be significantly improved. For example, there may be the following CRC registers and look-up table modules in the hardware module:
CRC32 r2, r1,0xFFFFFFFF// calculates CRC32 check value using hardware acceleration module and returns to register r2 for further processing after packet parsing
After protocol parsing and CRC checking are completed, the information of the data packet may be used for further stream processing. The key of this process is to extract and transfer the parsed information to the flow module, so as to further determine the forwarding path of the data packet according to the target information.
Extracted information:
Protocol header field (Source Address, destination Address, protocol type, etc.)
CRC check result (whether or not the check is passed)
Error handling-if the CRC check fails (i.e., the integrity of the packet is not passed), the packet is discarded, preventing erroneous data from entering the subsequent processing chain.
RISC-V custom instruction example:
Data load instruction:
ld r1, [ r0]// load 16 byte source address from memory into register r1
CRC calculation instructions:
CRC32 check of data in CRC32 r2, r1,0xFFFFFFFF// calculate r1 register, initial value is 0xFFFFFFFF;
The design concept of the customized RISC-V instruction set is as follows:
the customized instruction set typically includes hardware level optimizations for specific operations, such as data fetches, check computations, etc., which are critical to the processing of the RDMA protocol. We will design the custom instruction set by:
data extraction instructions-efficient extraction of protocol header fields from RDMA packets.
CRC check instruction-speed up CRC calculation to ensure data integrity.
Byte order and memory operation instruction, simplified byte order conversion and memory access.
Data extraction instruction design:
In RDMA packets, extraction of header information is a key step in the parsing process. The header of a packet typically contains fields for source address, destination address, protocol type, etc. The purpose of the custom instruction is to access the memory directly and efficiently fetch the fields in a hardware-accelerated manner.
Source address fetch instruction:
In the RDMA protocol, the source address is typically a 16 byte (IPv 6 address) or 4 byte (IPv 4 address) field. A custom instruction is needed to quickly extract the address information.
For example, assuming that the source address is stored in a location in the packet buffer, the instruction set may include the following:
ld64 r1, [ r0]// from address r0, load 64 bits (8 bytes) of data into register r1;
Here, r0 stores the starting memory location of the source address. With the ld64 instruction, we can load 64 bits of data into register r1 at a time. If a complete source address needs to be extracted, the loading can be performed in multiple times. For example, in the case of an IPv6 address, two ld64 operations may be required to extract a 16 byte full source address.
Ld128 if the source address is IPv6, a longer load instruction (e.g., ld 128) can be used to load 128 bits of data at a time.
Target address fetch instruction:
The extraction of the destination address is similar to the source address, and the customized instruction can also improve efficiency by optimizing memory loading, assuming that the destination address is also stored in a location in the packet:
ld64 r2, [ r0+16]// load 64 bits of data to r2 (target address) from a position offset by 16 bytes from the r0 address;
This instruction will begin loading the first 8 bytes of the target address from the 16 bytes of the source address offset. If the target address is 4 bytes (IPv 4 address), only 32 bits of data need to be loaded, the instruction can be similarly optimized.
CRC check instruction design:
the CRC check is an indispensable part of the packet check, especially in network protocols, where it is important to guarantee the integrity of the data. To speed up the CRC check, special hardware acceleration instructions may be designed.
CRC-32 calculation:
CRC-32 is a commonly used 32-bit check algorithm, widely used in network transmission protocols. To speed up CRC computation, a hardware-accelerated CRC computation instruction may be designed:
wherein d=d0,d1,…,dn―1 represents the content of the data packet, and CRC is calculated by bits; INITIALCRC represents the CRC initial value;
In the case of hardware acceleration, the computation of the CRC may be performed quickly by a table look-up method. We assume that a hardware CRC module is designed that handles CRC-32 computation exclusively and calls the module directly through instructions.
The instruction format is CRC32 r2, r1,0xFFFFFFFF// calculate CRC32 check of data in r1 register, the initial value is 0xFFFFFFFF;
The working mode of the instruction is as follows:
r1 stores the data to be verified (possibly the payload or header of the data packet).
0XFFFFFFFF is the initial value of the CRC check, typically all 1 s.
R2 stores the final CRC check value.
The hardware acceleration module performs CRC calculation based on a table look-up method, and the table look-up method greatly improves the speed of CRC calculation. After each CRC calculation, the result may be stored in a register for use in subsequent processing.
CRC instruction design concept:
and calculating a check value, namely directly performing CRC calculation in hardware, and accelerating the CRC process by adopting a quick table look-up method.
And the data processing pipeline is used for transmitting the data stream through the pipeline, so that a plurality of data blocks can calculate CRC (cyclic redundancy check) in parallel, and the throughput is improved.
Endian and memory access optimization:
in network protocol processing, the problem of byte order is a common challenge. To optimize both the endian translation and memory access, we can design specific endian translation instructions.
A byte order conversion instruction:
Network protocols typically use Big-end Endian (Big-Endian), but many processors use Little-end Endian (Little-Endian). To address the problem of endian conversion, we have designed a hardware-supported endian conversion instruction:
bswap r1, r2// performing byte order inversion on the data in r2, and storing the result into r 1;
the instruction will reverse the data endian stored in r2 and store it into r 1. For example, 32-bit data is converted from small-end endian to large-end endian.
Step 3, the parsed and verified data packet is transmitted to a circulation module according to the target information, and the local computing node determines a forwarding path of the data packet according to the routing table and the topology information;
In step 2, the header of the data packet has been extracted and CRC checked by the hardware acceleration module. The destination information (e.g., destination IP address or destination MAC address) of the packet is extracted and stored in a register or memory. At this point, the destination information of the packet will be the input of the routing decision.
Suppose that the target address is stored in the r1 register and has already been resolved. The target information may include:
A target IP address (IPv 4 or IPv 6);
Destination MAC address (if data link layer forwarding);
Type of service (e.g., qoS, priority), etc.
Routing table lookup and path computation:
in order to achieve efficient forwarding of data packets, the system first needs to look up a local routing table. The routing table stores the mapping relation between each target address and the next hop route. The lookup of the routing table is typically accomplished by a longest prefix match (Longest Prefix Match, LPM) algorithm.
The process of routing table lookup is as follows:
And matching the target IP address with each item in the routing table, and selecting the item matched with the longest prefix.
Next hop information (e.g., IP address, egress interface, etc. of the next hop router) of the matching entry is obtained.
And determining a forwarding path of the data packet according to the matching result.
To accelerate this process, hardware acceleration may be performed using a custom RISC-V instruction set.
Custom RISC-V instruction acceleration routing table lookup
In the RISC-V architecture, to accelerate the routing table lookup process, specific instructions or hardware acceleration modules may be designed to optimize the LPM lookup. For example, a hardware Trie or hash table may be used to store routing information to enable fast lookups.
Custom instructions:
LPM lookup instruction assume we use a hardware-supported LPM lookup module to accelerate the prefix matching process by customizing the instruction:
The lpm_lookup r2, r1, routing_table// find the target address r1 in the routing_table, and store the result in r 2;
The instruction can execute longest prefix matching in the routing_table through the hardware module, quickly obtain the routing information (such as the IP address, the outbound interface and the like of the next hop router) corresponding to the target address, and store the search result in the r2 register.
Routing information is stored by a hardware-supported Trie tree (prefix tree) and prefix matching (LPM) is performed in conjunction with custom RISC-V instructions. The Trie is essentially an efficient lookup based on prefix matching, and is suitable for use in the Longest Prefix Matching (LPM) algorithm in the IP routing table. We will realize efficient prefix matching based on the hardware-accelerated Trie structure and further optimize the LPM lookup process.
Basic structure of Trie:
Trie is a dictionary tree based data structure with each node representing a possible prefix and representing a different path through branches. To accommodate the Longest Prefix Match (LPM) of an IP address, each node of the Trie may hold a binary bit, representing a 0 or 1 of a certain bit.
For an IPv4 address, a 32-bit IP address may be stored by a Trie. Each node represents a binary bit (0 or 1) of the IP address. The binary value consisting of 0 and 1 on the path from the root node to a certain leaf node represents an IP prefix.
In a hardware implementation, the Trie may be represented as a fixed-size array of memory, with each node corresponding to a location in the array, and the branches (children) of the node may be implemented by hardware pointers or indexes. Each node contains the following information:
branch pointers (i.e., pointers to the left child node or right child node, representing 0 or 1);
the result of the match is whether the node is the endpoint of a certain prefix.
Routing information, such as next hop address, interface information, etc.
Since the Trie is a data structure arranged in a prefix hierarchy, the query procedure is a procedure of sequentially searching each binary bit. Hardware may speed up this lookup process by parallelizing instructions and speeding up storage mechanisms.
Customizing RISC-V instructions and a hardware acceleration module:
Assuming that the custom instruction lpm_lookup supported by the RISC-V architecture can directly call the hardware accelerated Trie module, the LPM lookup is accelerated by the instruction, which is a technical scheme combining hardware and the instruction:
custom instruction lpm_lookup
Through the lpm_lookup instruction provided by the hardware acceleration module, the RISC-V processor can directly interact with the hardware Trie to perform prefix matching search;
The instruction formats are lpm_lookup r2, r1, routing_trie// r1, the input target IP address, the routing_trie, the Trie tree structure of the storage routing table, and the r2, the output of the matched routing information;
the execution flow of the instruction is as follows:
the input destination IP address (e.g., 32 bits of IPv 4) is stored in register r 1.
The hardware Trie module reads the IP address of r1 and starts from the root node of the Trie, and searches bit by bit.
After finding the longest matching prefix of the target address, the relevant routing information (such as the next hop address, the outgoing interface, etc.) is output and stored in the register r 2.
At the hardware level, the Trie search process may be parallelized to increase the search speed. For example, multiple binary bits are scanned simultaneously using a multi-channel parallel scanning technique.
And (3) designing a hardware module:
The hardware Trie module may consist of the following key parts:
and the address input interface is used for receiving the target IP address and analyzing the target IP address into a series of binary bits.
And a Trie tree storage unit for storing node information of the Trie tree by hardware, wherein the node information comprises child node pointers (branches of 0 or 1) of each node.
And the parallel search engine accelerates the search process by using a parallel scanning technology, and accelerates the longest prefix matching speed by multi-bit parallel search.
And outputting an interface, namely returning the matched routing information (such as a next hop address, an outgoing interface, a service type and the like) through the r2 register.
The query process comprises the following steps:
in the query process, the hardware performs the LPM lookup by:
the input target IP address (e.g., 32 bits of IPv 4) is transferred to the Trie bit by bit.
Starting from the root node of the Trie, matching is performed for each bit (from the most significant bit to the least significant bit) of the target IP address:
if the current bit of the target IP is 0, continuing searching along the left subtree (the pointer is 0);
if the current bit of the target IP is 1, the lookup continues along the right subtree (pointer 1).
Each time a node is queried, the routing information for that node is recorded (if the node is a valid prefix termination point).
And after all the bits are queried, returning the routing information corresponding to the longest matching prefix.
The traditional LPM searching is carried out bit by bit, but in hardware, a plurality of bits can be searched simultaneously by utilizing a parallel technology, so that the searching speed is greatly increased. For example, in the search of each layer of Trie tree, multiple bits can be searched at the same time, and the search of Trie tree is supported by special hardware circuit, so that multiple data access delay in software execution can be eliminated, and the performance is greatly improved. The hardware acceleration module can provide low-delay searching and support high-concurrency request, and the Trie tree of the hardware implementation has fixed storage layout, thereby avoiding dynamic memory allocation and complex data structure operation in the traditional software implementation.
Step 4, based on the determined forwarding path, the data packet is forwarded to a target processing unit or a network interface, wherein if the forwarding target is a local computing node, the data packet is stored in a cache;
And 5, in the data packet forwarding process, the hardware scheduling module dynamically adjusts flow distribution according to the current load condition and controls the processing load of each processing unit to avoid overload.
Said step 5 comprises the steps of:
The hardware scheduling module acquires real-time load data of each processing unit by periodically monitoring the current load condition of each processing unit, wherein the real-time load data comprises the current queue length, the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate;
Judging a threshold value, namely if the load of a certain processing unit exceeds a preset upper limit threshold value, the processing unit is regarded as an overload state and needs to be reassigned;
If the load is lower than a preset lower threshold, the processing unit is regarded as an idle state;
task migration, namely migrating a data packet from a processing unit in an overload state to a processing unit in an idle state through a hash algorithm;
Flow segmentation, namely, a hardware scheduling module distributes flow according to the load condition of each processing unit to realize load balancing:
Where Pin denotes the number of packets to be forwarded, Pout,i denotes the number of packets allocated to the ith processing unit, Li denotes the load value of the ith processing unit, Lmax denotes the maximum load value, and N denotes the number of processing units.
Preferably, the step of implementing task migration by the hash algorithm is as follows:
Load sense value calculation, calculating a load sense value for each processing unit:
LAHVi=w1×CPUi+w2×QueueLengthi+w3×Memoryi+w4×Bandwidthi;
Wherein LAHVi denotes a load sensing value of the ith processing unit, CPUi denotes a CPU utilization rate of the ith processing unit, queueLengthi denotes a queue length of the ith processing unit, memoryi denotes a Memory utilization rate of the ith processing unit, bandwidthi denotes a Bandwidth utilization rate of the ith processing unit, and w1、w2、w3、w4 respectively denote weight coefficients;
calculating hash values by applying a hash function to the load sense values of each processing unit, mapping the load sense values to a location of the hash ring:
Hashi=HashFunction(LAHVi);
Wherein Hashi represents the Hash value of the ith processing unit;
Ring construction, mapping each processing unit to a location of the hash ring;
Calculating a hash value of the data packet by using a certain identifier of the data packet:
TaskHash=HashFunction(PacketIdentifier);
Wherein TaskHash denotes a data packet hash value, PACKETIDENTIFIER denotes an identifier of the data packet;
Searching a processing unit on the hash ring, namely searching the next processing unit clockwise from the hash value on the hash ring, and according to the size of the hash value, searching the processing unit with the lightest load and matched hash value from the hash ring to perform task migration;
After the data packet is migrated to the target processing unit, the load sensing value of the target processing unit needs to be updated, and the node position on the hash ring is dynamically adjusted by periodically calculating the load sensing value of each processing unit.
The flow distribution of the data packet is dynamically adjusted by monitoring the load condition of each processing unit in real time, so that the system can still stably operate under the high load condition. By adopting a load balancing algorithm, task migration and flow segmentation strategy, overload phenomenon can be effectively avoided, and the performance and stability of the system are maintained.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims (7)

CN202510053378.6A2025-01-142025-01-14 Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecturePendingCN119996539A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510053378.6ACN119996539A (en)2025-01-142025-01-14 Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510053378.6ACN119996539A (en)2025-01-142025-01-14 Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture

Publications (1)

Publication NumberPublication Date
CN119996539Atrue CN119996539A (en)2025-05-13

Family

ID=95636132

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510053378.6APendingCN119996539A (en)2025-01-142025-01-14 Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture

Country Status (1)

CountryLink
CN (1)CN119996539A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105812258A (en)*2014-12-312016-07-27北京华为数字技术有限公司Method and device for data processing
CN106161120A (en)*2016-10-082016-11-23电子科技大学The distributed meta-data management method of dynamic equalization load
CN110007961A (en)*2019-02-012019-07-12中山大学A kind of edge calculations hardware structure based on RISC-V
CN113692725A (en)*2019-05-232021-11-23慧与发展有限责任合伙企业System and method for facilitating efficient load balancing in a Network Interface Controller (NIC)
CN114328348A (en)*2021-12-172022-04-12广东浪潮智慧计算技术有限公司FPGA acceleration board card and market data processing method thereof
CN116028238A (en)*2022-10-312023-04-28广东浪潮智慧计算技术有限公司Computing engine communication method and device
CN116069395A (en)*2022-12-142023-05-05中国电信股份有限公司Cloud computing acceleration DPU system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105812258A (en)*2014-12-312016-07-27北京华为数字技术有限公司Method and device for data processing
CN106161120A (en)*2016-10-082016-11-23电子科技大学The distributed meta-data management method of dynamic equalization load
CN110007961A (en)*2019-02-012019-07-12中山大学A kind of edge calculations hardware structure based on RISC-V
CN113692725A (en)*2019-05-232021-11-23慧与发展有限责任合伙企业System and method for facilitating efficient load balancing in a Network Interface Controller (NIC)
CN114328348A (en)*2021-12-172022-04-12广东浪潮智慧计算技术有限公司FPGA acceleration board card and market data processing method thereof
CN116028238A (en)*2022-10-312023-04-28广东浪潮智慧计算技术有限公司Computing engine communication method and device
CN116069395A (en)*2022-12-142023-05-05中国电信股份有限公司Cloud computing acceleration DPU system

Similar Documents

PublicationPublication DateTitle
US6553002B1 (en)Apparatus and method for routing data packets through a communications network
US10783102B2 (en)Dynamically configurable high performance database-aware hash engine
US7027442B2 (en)Fast and adaptive packet processing device and method using digest information of input packet
CN101095310B (en)Packet parsing processor and the method for parsing grouping in the processor
CN100531199C (en)A bounded index extensible hash-based IPv6 address lookup method
US6262983B1 (en)Programmable network
US7007101B1 (en)Routing and forwarding table management for network processor architectures
US20030031172A1 (en)TCP receiver acceleration
US6778534B1 (en)High-performance network processor
Asthana et al.Towards a gigabit IP router
CN113411380B (en)Processing method, logic circuit and equipment based on FPGA (field programmable gate array) programmable session table
CN113746749A (en)Network connection device
CN117527689B (en)Stream table unloading method, system, device, cluster and medium
JPH1198143A (en) ATM repeater
US7961636B1 (en)Vectorized software packet forwarding
CN102780616B (en)Network equipment and method and device for message processing based on multi-core processor
CN119996539A (en) Efficient parsing and forwarding method of RDMA protocol based on RISC-V architecture
Ding et al.A split architecture approach to terabyte-scale caching in a protocol-oblivious forwarding switch
CN118573516A (en)Tunnel message encapsulation and decapsulation method, equipment and system
CN100493042C (en) High-performance inter-node communication method in control plane of scalable router system
Zhao et al.High-performance implementation of dynamically configurable load balancing engine on FPGA
CN113079077A (en)Method and system for processing tunnel message symmetrical RSS under DPDk architecture based on queue
US12425356B1 (en)Reconfigurable processor-based programmable switching fabric and programmable data plane chip
CN112825101A (en)Chip architecture, data processing method thereof, electronic device and storage medium
Knox et al.Parallel searching techniques for routing table lookup

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp