Movatterモバイル変換


[0]ホーム

URL:


CN120179572B - Small IO copy-free data communication method based on RDMA - Google Patents

Small IO copy-free data communication method based on RDMA

Info

Publication number
CN120179572B
CN120179572BCN202510650536.6ACN202510650536ACN120179572BCN 120179572 BCN120179572 BCN 120179572BCN 202510650536 ACN202510650536 ACN 202510650536ACN 120179572 BCN120179572 BCN 120179572B
Authority
CN
China
Prior art keywords
layer
storage system
client
address
user data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510650536.6A
Other languages
Chinese (zh)
Other versions
CN120179572A (en
Inventor
汤鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Cloud Computing Technology Co ltd
Original Assignee
Zhongdian Cloud Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Cloud Computing Technology Co ltdfiledCriticalZhongdian Cloud Computing Technology Co ltd
Priority to CN202510650536.6ApriorityCriticalpatent/CN120179572B/en
Publication of CN120179572ApublicationCriticalpatent/CN120179572A/en
Application grantedgrantedCritical
Publication of CN120179572BpublicationCriticalpatent/CN120179572B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The application provides a small IO copy-free data communication method based on RDMA, which is characterized in that a source address in a small IO write service and a target address in a small IO read service are specified from a memory pool managed by an upper layer application, enough free space is reserved in front of the source address and the target address, a client directly fills a message header and a control field of a write request message in the reserved space in front of the source address, the client temporarily stores matching information by using a preset data structure before sending the read request message, a search key of the address information and the matching information is sent to a server along with the read request message, the server fills the message header and the control field of the read reply message in the reserved space in front of the target address in a RDMA WRITE mode, user data to be read is filled behind the target address, and the client knows that the user data is ready by means of the search key. By the method and the device, repeated copying of user data is avoided in small IO data communication, and performance of the small IO data communication can be obviously improved.

Description

RDMA-based small IO copy-free data communication method
Technical Field
The application relates to the technical field of distributed storage, in particular to a small IO copy-free data communication method based on RDMA.
Background
In today's high performance distributed storage systems, RDMA (Remote Direct Memory Access, remote direct data access) technology is increasingly being used with its performance advantages of zero copy, kernel bypass, and CPU offload. RDMA provides send, recv, read and write four communication modes, where send/recv is typically used in pairs for small IO communications, while read and write are used for large IO communications. These three performance advantages of RDMA are mainly reflected in the performance improvement of read and write on large IO communications. However, in a small IO scenario, some current mainstream distributed storage systems based on RDMA technology, the data communication scheme of which does not exert the RDMA complete performance level well due to the inefficient design on the data path. The main manifestations of these inefficient designs are:
(1) Because the design of IO data flow lacks global consideration, multiple user-state memory copies exist on the IO data path, and the memory copies have non-negligible CPU resource consumption, so that the small IO throughput performance is reduced to a certain extent;
(2) The choice of the small IO data communication mode is limited to the use of send/recv, and three technical advantages of RDMA are not fully exerted, so that IO performance of the small IO data communication mode does not reach a theoretical ideal state.
Disclosure of Invention
The application provides a small IO copy-free data communication method based on RDMA, which can solve the technical problem of poor performance of small IO data communication in the prior art.
In a first aspect, an embodiment of the present application provides a small IO copy-free data communication method based on RDMA, where the small IO copy-free data communication method includes:
The upper layer application of the client side issues a small IO write service to the client side of the storage system and designates a source address, wherein the source address is positioned in a memory pool managed by the upper layer application of the client side, enough free space is reserved in front of the source address, and first user data to be written into a server side of the storage system is stored in the rear of the source address;
the storage system client fills a first message header and a first control field between a first starting address and a source address, and sends the first message header, the first control field and first user data to the storage system server as a write request message, wherein the offset between the first starting address and the source address is equal to the length of the first message header and the first control field;
and the storage system server side executes the writing operation of the first user data according to the writing request message.
Further, in an embodiment, the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, and the first message header includes a communication layer header, an RPC layer header, and a storage adaptation layer header;
For the storage system client, the step of filling the first message header and the first control field between the first start address and the source address, and sending the first message header, the first control field and the first user data as a write request message to the storage system server includes:
the storage adaptation layer calls a Forward interface of the RPC layer;
the RPC layer is serialized through a callback function provided by the storage adaptation layer, an RPC layer header, the storage adaptation layer header and a first control field are filled from a second initial address, the RPC layer header, the storage adaptation layer header, the first control field and the first user data are used as RPC message bodies, a sending interface of the communication layer is called, and the offset between the second initial address and the source address is equal to the lengths of the RPC layer header, the storage adaptation layer header and the first control field;
The communication layer fills the communication layer header from the first starting address, takes the communication layer header and the RPC message body as a write request message, and calls the bottom layer network interface to send the write request information to the storage system server, wherein the offset between the first starting address and the second starting address is equal to the length of the communication layer header.
Further, in an embodiment, the small IO copy-free data communication method further includes:
the upper layer application of the client side issues a small IO read service to the client side of the storage system and designates a target address, wherein the target address is positioned in a memory pool managed by the upper layer application of the client side, enough free space is reserved in front of the target address, and the rear of the target address is used for storing second user data read from the server side of the storage system;
The storage system client adds matching information in a preset data structure, and sends address information and a search key of the matching information to the storage system server along with a read request message;
The storage system server side obtains second user data according to the read request message, if address information and a search key exist in the read request message, a third initial address is determined according to the address information, the second message header, a second control field and the second user data are used as read reply messages, the read reply messages are written into a memory pool of upper application management of the client side from the third initial address in a RDMA WRITE mode, the search key is sent to the storage system client side, and the offset before the third initial address and the target address is equal to the lengths of the second message header and the second control field;
the storage system client searches matching information from a preset data structure according to the search key and notifies the upper layer of the client that the second user data is ready for application;
the storage system client reads the second user data from the target address.
Further, in an embodiment, the source address of the small IO write service and the destination address of the small IO read service share the same memory pool.
In a second aspect, an embodiment of the present application further provides a small IO copy-free data communication method based on RDMA, where the small IO copy-free data communication method includes:
the upper layer application of the client side issues a small IO read service to the client side of the storage system and designates a target address, wherein the target address is positioned in a memory pool managed by the upper layer application of the client side, enough free space is reserved in front of the target address, and the rear of the target address is used for storing second user data read from the server side of the storage system;
The storage system client adds matching information in a preset data structure, and sends address information and a search key of the matching information to the storage system server along with a read request message;
The storage system server side obtains second user data according to the read request message, if address information and a search key exist in the read request message, a third initial address is determined according to the address information, the second message header, a second control field and the second user data are used as read reply messages, the read reply messages are written into a memory pool of upper application management of the client side from the third initial address in a RDMA WRITE mode, the search key is sent to the storage system client side, and the offset before the third initial address and the target address is equal to the lengths of the second message header and the second control field;
the storage system client searches matching information from a preset data structure according to the search key and notifies the upper layer of the client that the second user data is ready for application;
the storage system client reads the second user data from the target address.
Further, in an embodiment, for the storage system server, the step of writing the read reply message from the third start address to the memory pool managed by the upper layer application of the client by means of RDMA WRITE, and sending the search key to the storage system client includes:
And taking the search key as an immediate, and writing the read reply message into a memory pool managed by the upper application of the client from the third starting address in a RDMA WRITE mode with the immediate.
Further, in an embodiment, the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, the second message header includes an RPC layer header and a storage adaptation layer header, and the preset data structure is located in the RPC layer;
For the client of the storage system, the step of searching the matching information from the preset data structure according to the search key and informing the upper layer of the client that the second user data is ready comprises the following steps:
the communication layer reports the search key to the RPC layer;
The RPC layer searches matching information from a preset data structure according to the search key, and performs deserialization through a callback function provided by the storage adaptation layer to inform the storage adaptation layer that second user data is ready;
the storage adaptation layer informs the client that the upper layer application is ready for the second user data.
Further, in an embodiment, the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, the second message header includes an RPC layer header and a storage adaptation layer header, and the preset data structure is located in the notification layer;
For the client of the storage system, the step of searching the matching information from the preset data structure according to the search key and informing the upper layer of the client that the second user data is ready comprises the following steps:
The communication layer searches matching information from a preset data structure according to the search key, and reports the matching information to the RPC layer;
The RPC layer performs deserialization through a callback function provided by the storage adaptation layer, and informs the storage adaptation layer that second user data is ready;
the storage adaptation layer informs the client that the upper layer application is ready for the second user data.
Further, in an embodiment, when the preset data structure is a stack, the search key is an index;
when the preset data structure is a hash table, the search key is a key.
Further, in an embodiment, the small IO copy-free data communication method further includes:
The upper layer application of the client side issues a small IO write service to the client side of the storage system and designates a source address, wherein the source address is positioned in a memory pool managed by the upper layer application of the client side, enough free space is reserved in front of the source address, and first user data to be written into a server side of the storage system is stored in the rear of the source address;
the storage system client fills a first message header and a first control field between a first starting address and a source address, and sends the first message header, the first control field and first user data to the storage system server as a write request message, wherein the offset between the first starting address and the source address is equal to the length of the first message header and the first control field;
and the storage system server side executes the writing operation of the first user data according to the writing request message.
In the application, for small IO write service, a source address is designated from a memory pool managed by an upper application of a client, enough free space is reserved in front of the source address, a message head and a control field of a write request message are directly filled in the reserved space in front of the source address by a storage system client, and the storage system client does not need to copy user data to be written.
For small IO read service, a target address is designated from a memory pool managed by an upper layer application of a client, enough free space is reserved in front of the target address, a storage system client temporarily stores matching information by using a preset data structure before sending a read request message, address information and a search key of the matching information are sent to a storage system server along with the read request message, the storage system server fills a message head and a control field of a read reply message in the reserved space in front of the target address in a RDMA WRITE mode, user data to be read are filled behind the target address, the storage system client knows that the user data is ready by means of the search key, and the storage system client does not need to copy the user data to be read.
By the method and the device, repeated copying of user data is avoided in small IO data communication, and performance of the small IO data communication can be obviously improved.
Drawings
FIG. 1 is a flow chart of a small IO copy-free data communication method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a message structure of a prior art memory system;
FIG. 3 is a schematic diagram illustrating the operation of a client of a storage system for a small IO write service in the prior art;
FIG. 4 is a schematic diagram of an assembly sending write request message according to an embodiment of the present application;
FIG. 5 is a flow chart of a method for small IO copy-free data communication in accordance with another embodiment of the present application;
FIG. 6 is a schematic diagram illustrating the operation of a memory system according to the prior art for a small IO read service;
FIG. 7 is a schematic diagram illustrating an operation of the memory system for a small IO read service according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
In the prior art, the general flow of the small IO write service is that a storage system client copies user data (abbreviated as first user data) to be written into a storage system server from a source address designated by an upper layer application of the client, and accordingly assembles a write request message, sends the write request message to the storage system server, and after receiving the write request message, the storage system server drops the first user data, assembles a write reply message, and sends the write reply message to the storage system client to inform the first user that the data writing is completed.
The general flow of the small IO read service is that a storage system client assembles a read request message, sends the read request message to a storage system server, specifies user data (called second user data for short) which is hoped to be read from the storage system server, and reads a disk to obtain the second user data after the storage system server receives the read request message, and accordingly assembles a read reply message, sends the read reply message to the storage system client, the read reply message carries the second user data to be read, and the storage system client copies the second user data to a target address specified by an upper layer application of the client after receiving the read reply message.
In a first aspect, the embodiment of the application performs flow optimization around a small IO write service, and provides a small IO copy-free data communication method based on RDMA.
FIG. 1 is a flow chart of a small IO copy-free data communication method in accordance with an embodiment of the present application.
Referring to fig. 1, in one embodiment, the small IO copy-free data communication method includes the steps of:
s11, the upper application of the client side issues a small IO write service to the client side of the storage system and designates a source address, wherein the source address is positioned in a memory pool managed by the upper application of the client side, enough free space is reserved in front of the source address, and first user data to be written into the server side of the storage system is stored in the rear of the source address.
In this embodiment and the prior art, the upper layer application of the client needs to write the first user data to the rear of the source address first, and then issue the small IO write service to the storage system client and designate the source address. The difference is that the source address in the prior art is a temporary application memory address, and no free space is required to be reserved in front of the source address, in this embodiment, the source address uses a memory block in a memory pool managed by an upper application of the client, so that enough free space is reserved in front of the source address, and is used for storing data, that is, a first message header and a first control field, located in front of the first user data in the write request message.
S12, the storage system client fills a first message header and a first control field between a first starting address and a source address, and sends the first message header, the first control field and first user data to the storage system server as a write request message, wherein the offset between the first starting address and the source address is equal to the length of the first message header and the first control field.
Specifically, in the process of assembling the write request message, in the prior art, the first message header and the first control field are written in other positions, and then the first user data is copied from the source address to the rear of the first control field.
It should be noted that, in this embodiment, the write request message formed by final assembly is identical to the write request message formed by the prior art, and the first message header and the first control field in the write request message need to be assigned or copied, but because these parts are generally within 1K in length, and the first user data is up to 19K, the first user data copy is the part with the greatest performance consumption, and this embodiment omits the copy operation of the storage system client to the first user data.
S13, the storage system server side executes the writing operation of the first user data according to the writing request message.
In this embodiment, the write request message received by the storage system server is the same as the prior art, so that the write operation performed based on the write request message is unchanged.
Therefore, in this embodiment, for the small IO write service, a source address is specified from a memory pool managed by an upper layer application of the client, a sufficient free space is reserved in front of the source address, and the storage system client directly fills a message header and a control field of the write request message in the reserved space in front of the source address, so that the storage system client does not need to copy user data to be written, thereby improving performance of the small IO write service.
In particular, since the write reply message does not include user data, the copy operation performance consumption involved in the assembly process is very small, so the embodiment is not limited to how the write reply message is specifically assembled, and the assembly mode in the prior art can be adopted, or the assembly mode similar to the assembly mode of the write request message in the embodiment can be adopted.
Fig. 2 shows a message structure diagram of a storage system in the prior art.
Referring to fig. 2, in the prior art, the client and service end architectures of the storage system are similar, each including a storage adaptation layer, an RPC (Remote Process Call, remote procedure call) layer and a communication layer, and are sequentially processed by the storage adaptation layer, the RPC layer and the communication layer in the message assembly process, and are sequentially processed by the communication layer, the RPC layer and the storage adaptation layer in the message parsing process, and accordingly, the message structure includes a communication layer header, an RPC layer header, a storage adaptation layer header, a control field and payload data, where the communication layer header, the RPC layer header and the storage adaptation layer header are respectively used to control the communication layer, the RPC layer and the storage adaptation layer to execute related operations, and the control field generally includes information such as a data pool id, a checksum and the like.
It should be noted that fig. 2 is merely a general schematic diagram of a message structure, and an actual message structure is more complex, and detailed components thereof are not explained herein.
Specifically, in the write request message, the payload data is the first user data, and in the write reply message, the payload data is used to indicate that the writing is completed.
FIG. 3 illustrates a schematic diagram of the operation of a prior art storage system client for small IO write traffic.
Referring to fig. 3, when a client upper layer application issues a small IO write service, a source address is designated to a storage adaptation layer, the storage adaptation layer calls a Forward interface of an RPC layer, the RPC layer sequences a callback function provided by the storage adaptation layer, an RPC layer header, the storage adaptation layer header and a first control field are filled into the RPC layer buffer in a manner of assignment or memory copy, then first user data is copied from the source address to the rear of the control field to form an RPC message body, a sending interface of a communication layer is called, the communication layer uses the manner of assignment to fill the communication layer header into the communication layer buffer, then the RPC message body is copied from the RPC layer buffer to the rear of the communication layer header to form a write request message, and finally the bottom layer network interface is called to send the write request message.
In the above flow of assembling the write request message, there are two user state memory copies from the source address to the RPC layer buffer and from the RPC layer buffer to the communication layer buffer, and the copied data all contain the first user data, which causes relatively obvious IO performance loss.
Fig. 4 is a schematic diagram illustrating the principle of assembling and transmitting a write request message in an embodiment of the present application.
Referring to fig. 4, further, in an embodiment, a storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, and a first message header includes a communication layer header, an RPC layer header, and a storage adaptation layer header;
For the storage system client, the step of filling the first message header and the first control field between the first start address and the source address, and sending the first message header, the first control field and the first user data as a write request message to the storage system server includes:
the storage adaptation layer calls a Forward interface of the RPC layer;
the RPC layer is serialized through a callback function provided by the storage adaptation layer, an RPC layer header, the storage adaptation layer header and a first control field are filled from a second initial address, the RPC layer header, the storage adaptation layer header, the first control field and the first user data are used as RPC message bodies, a sending interface of the communication layer is called, and the offset between the second initial address and the source address is equal to the lengths of the RPC layer header, the storage adaptation layer header and the first control field;
The communication layer fills the communication layer header from the first starting address, takes the communication layer header and the RPC message body as a write request message, and calls the bottom layer network interface to send the write request information to the storage system server, wherein the offset between the first starting address and the second starting address is equal to the length of the communication layer header.
In this embodiment, by applying the open memory registration interface to the upper layer of the client and replacing the memory copy operation with the pointer operation, two user-mode memory copies in the scheme shown in fig. 3 are avoided, and the performance of the small IO write service is significantly improved.
In fig. 4, the offset between the second start address and the source address is denoted as a first offset, and the offset between the first start address and the second start address is denoted as a second offset, where the first offset may be floating due to whether to start check, message size, etc., and a real-time calculation function may be defined for calculating the first offset in comparison with the serialization process.
In a second aspect, the embodiment of the application performs flow optimization around a small IO read service, and provides a small IO copy-free data communication method based on RDMA.
FIG. 5 is a flow chart of a small IO copy-free data communication method in accordance with another embodiment of the present application.
Referring to fig. 5, in one embodiment, the small IO copy-free data communication method includes the steps of:
s21, the upper application of the client issues a small IO read service to the client of the storage system and designates a target address, wherein the target address is positioned in a memory pool managed by the upper application of the client, enough free space is reserved in front of the target address, and the rear of the target address is used for storing second user data read from the server of the storage system.
In this embodiment and the prior art, the upper layer application of the client needs to issue a small IO read service to the client of the storage system and specify a target address, and an idle space for storing the second user data is reserved behind the target address. The difference is that the target address in the prior art is a memory address of a temporary application, and no free space is required to be reserved in front of the target address, in this embodiment, the target address uses a memory block in a memory pool managed by an upper layer application of the client, so as to ensure that enough free space is reserved in front of the target address for storing data located in front of second user data in the read reply message, that is, a second message header and a second control field.
S22, the storage system client adds matching information in a preset data structure, and sends the address information and the search key of the matching information to the storage system server along with the read request message.
Before the storage system client sends the read request message, the prior art adds the matching information in the receiving queue, in this embodiment, the matching information is added in the preset data structure, and the read request message finally assembled in this embodiment further carries address information and a search key of the matching information on the basis of the prior art.
Optionally, the preset data structure is a stack, a hash table or other data structure, the search key is an index when the preset data structure is a stack, and the search key is a key when the preset data structure is a hash table.
S23, the storage system server acquires second user data according to the read request message, if address information and a search key exist in the read request message, a third initial address is determined according to the address information, the second message header, a second control field and the second user data are used as read reply messages, the read reply messages are written into a memory pool managed by an upper application of the client from the third initial address in a RDMA WRITE mode, the search key is sent to the storage system client, and the offset before the third initial address and the target address is equal to the length of the second message header and the second control field.
In this embodiment, in the prior art, after receiving the read request message, the storage system server receives the second user data and assembles a read reply message.
In this embodiment, the read request message received by the server of the storage system is different from the prior art, and further includes address information and a search key, so that the related operations performed based on the read request message are also different. The manner in which the read reply message is transferred in the prior art is RDMA send, the manner in which the read reply message is transferred in this embodiment is RDMA WRITE, and the address information is used to determine the location where the read reply message starts to be written, that is, the third starting address, so as to ensure that the second user data is accurately written to the rear of the target address. Since the storage system client is unaware of RDMA WRITE, the storage system server may also send a lookup key to the storage system client for the storage system client to know that the read reply message has been written to the memory pool managed by the client upper layer application.
As an alternative implementation manner, the storage system client calculates the third starting address according to the target address and the corresponding offset, adds the third starting address as address information to the read request message, and the storage system server directly knows the third starting address without performing additional calculation.
As another alternative embodiment, the storage system client adds the target address as address information to the read request message, and the storage system server calculates the third start address according to the target address and the corresponding offset.
It will be appreciated that the offset is equal to the length of the second message header and the second control field, which information can be known to both the storage system client and the storage system server, and that the difference between the two embodiments is whether the third start address is calculated at the client or at the server. In theory, both embodiments are feasible, and under the condition that the calculation accuracy can be ensured, the former mode distributes the load through the client, so that the calculation pressure of the server can be reduced.
And S24, the storage system client searches the matching information from the preset data structure according to the search key and informs the upper layer of the client that the second user data is ready for application.
In the prior art, after receiving the read reply message, the storage system client matches the read reply message with the receiving queue to determine which read request message the read reply message matches, and copies the second user data in the read reply message to the corresponding target address.
In this embodiment, after receiving the search key, the storage system client searches the matching information from the preset data structure according to the search key to determine which second user data corresponding to the read request message is written to the target address, thereby notifying the upper layer of the client that the second user data is ready for application. Therefore, the present embodiment omits the copy operation of the second user data by the storage system client.
S25, the storage system client starts to read the second user data from the target address.
Therefore, in this embodiment, for the small IO read service, a target address is specified from a memory pool managed by an upper layer application of the client, a sufficient free space is reserved in front of the target address, before the storage system client sends a read request message, a preset data structure is used to temporarily store matching information, a search key of the address information and the matching information is sent to the storage system server along with the read request message, the storage system server fills a message header and a control field of a read reply message in the reserved space in front of the target address in a RDMA WRITE manner, user data to be read is filled behind the target address, the storage system client knows that the user data is ready by means of the search key, and the storage system client does not need to copy the user data to be read, thereby improving performance of the small IO read service.
In particular, since the read request message does not include user data, the copy operation performance consumption involved in the assembly process is very low, so the embodiment does not limit how the read request message is assembled, except that the read request message is required to additionally carry the address information and the search key of the matching information, and the assembly mode in the prior art can be adopted, or the assembly mode similar to the assembly mode of the write request message in the embodiment can be adopted.
Further, in an embodiment, for the storage system server, the step of writing the read reply message from the third start address to the memory pool managed by the upper layer application of the client by means of RDMA WRITE, and sending the search key to the storage system client includes:
And taking the search key as an immediate, and writing the read reply message into a memory pool managed by the upper application of the client from the third starting address in a RDMA WRITE mode with the immediate.
In this embodiment, by means of RDMA WRITE with an immediate, the search key is sent to the storage system client as the immediate while the writing of the read reply message is completed, so that the information transfer efficiency is improved. It will be appreciated that the length of the lookup key is within the limit of the immediate.
Of course, in other embodiments, the message may also be assembled with the lookup key as payload data and sent to the storage system client in an RDMA send fashion, but this approach is relatively inefficient in information transfer.
With continued reference to fig. 2, in the read request message, the payload data is used for the storage system server to find the second user data, and in the read reply message, the payload data is the second user data.
Fig. 6 shows a schematic diagram of the operation of a prior art memory system for small IO read traffic.
Referring to fig. 6, when the application of the upper layer of the client issues the small IO read service, a target address is designated to the storage adaptation layer, before the RPC layer invokes the sending interface of the communication layer, the receiving interface of the communication layer is invoked to prepare to receive the reply, this process adds the matching information into the receiving queue of the communication layer, after the communication layer receives the read reply message, the read reply message is temporarily stored in the buffer of the communication layer, after the waiting queue completes matching, the RPC message body is copied to the buffer of the RPC layer, the RPC layer performs deserialization through a callback function provided by the storage adaptation layer, and then copies the second user data to the target memory address, and notifies the storage adaptation layer that the second user data is ready, and the storage adaptation layer notifies the upper layer of the client that the second user data is ready.
In the above flow of receiving the read message reply, there are two user state memory copies from the communication layer buffer to the RPC layer buffer and from the RPC layer buffer to the target address, and the copied data all contain the second user data, which causes relatively obvious IO performance loss.
FIG. 7 is a schematic diagram illustrating the operation of a memory system for small IO read traffic in an embodiment of the present application.
Referring to fig. 7, as an alternative embodiment, the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, the second message header includes an RPC layer header and a storage adaptation layer header, and the preset data structure is located in the RPC layer;
For the client of the storage system, the step of searching the matching information from the preset data structure according to the search key and informing the upper layer of the client that the second user data is ready comprises the following steps:
the communication layer reports the search key to the RPC layer;
The RPC layer searches matching information from a preset data structure according to the search key, and performs deserialization through a callback function provided by the storage adaptation layer to inform the storage adaptation layer that second user data is ready;
the storage adaptation layer informs the client that the upper layer application is ready for the second user data.
As another optional implementation manner, the storage system client includes a storage adaptation layer, an RPC layer and a communication layer, the second message header includes an RPC layer header and a storage adaptation layer header, and the preset data structure is located in the notification layer;
For the client of the storage system, the step of searching the matching information from the preset data structure according to the search key and informing the upper layer of the client that the second user data is ready comprises the following steps:
The communication layer searches matching information from a preset data structure according to the search key, and reports the matching information to the RPC layer;
The RPC layer performs deserialization through a callback function provided by the storage adaptation layer, and informs the storage adaptation layer that second user data is ready;
the storage adaptation layer informs the client that the upper layer application is ready for the second user data.
The two embodiments are different in that the preset data structure is different in position, the object for searching the matching information from the preset data structure is different, and the information reported to the RPC layer by the communication layer is different.
In the above two embodiments, the open memory registration interface is applied to the upper layer of the client, so that the upper layer application can directly transmit the registered network communication buffer for the communication layer, and the matching of the read reply message by means of the receiving queue of the communication layer is not needed, the head of the communication layer does not exist in the read reply message, the second user data is accurately written to the rear of the target address, the two user state memory copies in the scheme shown in fig. 6 are avoided, and the performance of the small IO read service is obviously improved.
In fig. 7, the offset between the third start address and the target address is denoted as a third offset, which is typically a fixed value, and may be calculated in advance to be used as a constant.
It can be appreciated that the present embodiments respectively optimize the flow around the small IO write traffic and the small IO read traffic, and propose two sets of embodiments, which can be used independently or in combination. That is, when the small IO write service uses the optimization flow shown in fig. 1, the small IO read service may use the existing flow or the optimization flow shown in fig. 5, and when the small IO read service uses the optimization flow shown in fig. 5, the small IO write service may use the existing flow or the optimization flow shown in fig. 1.
Optionally, when the small IO write service uses the optimization flow shown in fig. 1 and the small IO read service uses the optimization flow shown in fig. 5, the source address of the small IO write service and the target address of the small IO read service may share the same memory pool, or may respectively use different memory pools.
It should be noted that, the foregoing reference numerals of the embodiments of the present application are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments.
The terms "comprising" and "having" and any variations thereof in the description and claims of the application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and "third," etc. are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order, and are not limited to the fact that "first," "second," and "third" are not identical.
In describing embodiments of the present application, "exemplary," "such as," or "for example," etc., are used to indicate by way of example, illustration, or description. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.
In the description of the embodiment of the present application, "/" means or, for example, a/B may mean a or B, and "and/or" in the text is merely an association relationship describing an association object, means that three relationships may exist, for example, a and/or B, three cases where a exists alone, a and B exist together, and B exists alone, and further, in the description of the embodiment of the present application, "a plurality" means two or more.
In some of the processes described in the embodiments of the present application, a plurality of operations or steps occurring in a particular order are included, but it should be understood that the operations or steps may be performed out of the order in which they occur in the embodiments of the present application or in parallel, the sequence numbers of the operations merely serve to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the processes may include more or fewer operations, and the operations or steps may be performed in sequence or in parallel, and the operations or steps may be combined.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising several instructions for causing a terminal device to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (9)

Translated fromChinese
1.一种基于RDMA的小IO免拷贝数据通信方法,其特征在于,所述小IO免拷贝数据通信方法包括:1. A small IO copy-free data communication method based on RDMA, characterized in that the small IO copy-free data communication method comprises:客户端上层应用向存储系统客户端下发小IO写业务并指定源地址,其中,源地址位于客户端上层应用管理的内存池中,源地址的前方预留足够存放第一消息头部和第一控制字段的空余空间,源地址的后方存放待写入存储系统服务端的第一用户数据,第一消息头部包括通信层头部、RPC层头部和存储适配层头部;The client upper-layer application sends a small I/O write service to the storage system client and specifies a source address. The source address is located in a memory pool managed by the client upper-layer application. Sufficient free space is reserved in front of the source address to store the first message header and the first control field. The first user data to be written to the storage system server is stored after the source address. The first message header includes a communication layer header, an RPC layer header, and a storage adaptation layer header.在存储系统客户端中:In the storage system client:存储适配层调用RPC层的Forward接口;The storage adaptation layer calls the Forward interface of the RPC layer;RPC层通过存储适配层提供的回调函数进行序列化,从第二起始地址开始填充RPC层头部、存储适配层头部和第一控制字段,将RPC层头部、存储适配层头部、第一控制字段和第一用户数据作为RPC消息体,调用通信层的发送接口,其中,第二起始地址和源地址之间的偏移量等于RPC层头部、存储适配层头部和第一控制字段的长度;The RPC layer serializes the message using the callback function provided by the storage adaptation layer, fills the RPC layer header, storage adaptation layer header, and first control field starting from the second start address, uses the RPC layer header, storage adaptation layer header, first control field, and first user data as the RPC message body, and calls the sending interface of the communication layer, where the offset between the second start address and the source address is equal to the length of the RPC layer header, storage adaptation layer header, and first control field.通信层从第一起始地址开始填充通信层头部,将通信层头部和RPC消息体作为写请求消息,调用底层网络接口将写请求信息发送至存储系统服务端,其中,第一起始地址和第二起始地址之间的偏移量等于通信层头部的长度;The communication layer fills the communication layer header starting from the first starting address, uses the communication layer header and the RPC message body as a write request message, and calls the underlying network interface to send the write request information to the storage system server. The offset between the first starting address and the second starting address is equal to the length of the communication layer header.存储系统服务端根据写请求消息执行第一用户数据的写入操作。The storage system server performs a write operation on the first user data according to the write request message.2.如权利要求1所述的小IO免拷贝数据通信方法,其特征在于,所述小IO免拷贝数据通信方法还包括:2. The small IO copy-free data communication method according to claim 1, further comprising:客户端上层应用向存储系统客户端下发小IO读业务并指定目标地址,其中,目标地址位于客户端上层应用管理的内存池中,目标地址的前方预留足够存放第二消息头部和第二控制字段的空余空间,目标地址的后方用于存放从存储系统服务端读取的第二用户数据;The client upper-layer application issues a small IO read service to the storage system client and specifies a target address. The target address is located in the memory pool managed by the client upper-layer application. Sufficient free space is reserved in front of the target address to store the second message header and the second control field. The space after the target address is used to store the second user data read from the storage system server.存储系统客户端在预设数据结构中添加匹配信息,将地址信息和匹配信息的查找键随读请求消息发送至存储系统服务端;The storage system client adds matching information to a preset data structure and sends the address information and the search key of the matching information along with a read request message to the storage system server;存储系统服务端根据读请求消息获取第二用户数据,若读请求消息中存在地址信息和查找键,则根据地址信息确定第三起始地址,将第二消息头部、第二控制字段和第二用户数据作为读回复消息,通过RDMA write的方式将读回复消息从第三起始地址开始写入客户端上层应用管理的内存池,将查找键发送至存储系统客户端,其中,第三起始地址和目标地址之前的偏移量等于第二消息头部和第二控制字段的长度;The storage system server obtains the second user data based on the read request message. If the read request message contains address information and a search key, the storage system server determines the third starting address based on the address information, uses the second message header, the second control field, and the second user data as a read reply message, and writes the read reply message to the memory pool managed by the client upper-layer application using RDMA write, starting from the third starting address. The server then sends the search key to the storage system client. The offset between the third starting address and the target address is equal to the length of the second message header and the second control field.存储系统客户端根据查找键从预设数据结构中查找匹配信息,通知客户端上层应用第二用户数据准备就绪;The storage system client searches for matching information from a preset data structure according to the search key and notifies the upper-layer application of the client that the second user data is ready;存储系统客户端从目标地址开始读取第二用户数据。The storage system client reads the second user data starting from the target address.3.如权利要求2所述的小IO免拷贝数据通信方法,其特征在于,小IO写业务的源地址和小IO读业务的目标地址共用同一内存池。3. The small IO copy-free data communication method according to claim 2, wherein the source address of the small IO write service and the target address of the small IO read service share the same memory pool.4.一种基于RDMA的小IO免拷贝数据通信方法,其特征在于,所述小IO免拷贝数据通信方法包括:4. A small IO copy-free data communication method based on RDMA, characterized in that the small IO copy-free data communication method includes:客户端上层应用向存储系统客户端下发小IO读业务并指定目标地址,其中,目标地址位于客户端上层应用管理的内存池中,目标地址的前方预留足够存放第二消息头部和第二控制字段的空余空间,目标地址的后方用于存放从存储系统服务端读取的第二用户数据;The client upper-layer application issues a small IO read service to the storage system client and specifies a target address. The target address is located in the memory pool managed by the client upper-layer application. Sufficient free space is reserved in front of the target address to store the second message header and the second control field. The space after the target address is used to store the second user data read from the storage system server.存储系统客户端在预设数据结构中添加匹配信息,将地址信息和匹配信息的查找键随读请求消息发送至存储系统服务端;The storage system client adds matching information to a preset data structure and sends the address information and the search key of the matching information along with a read request message to the storage system server;存储系统服务端根据读请求消息获取第二用户数据,若读请求消息中存在地址信息和查找键,则根据地址信息确定第三起始地址,将第二消息头部、第二控制字段和第二用户数据作为读回复消息,通过RDMA write的方式将读回复消息从第三起始地址开始写入客户端上层应用管理的内存池,将查找键发送至存储系统客户端,其中,第三起始地址和目标地址之前的偏移量等于第二消息头部和第二控制字段的长度;The storage system server obtains the second user data based on the read request message. If the read request message contains address information and a search key, the storage system server determines the third starting address based on the address information, uses the second message header, the second control field, and the second user data as a read reply message, and writes the read reply message to the memory pool managed by the client upper-layer application using RDMA write, starting from the third starting address. The server then sends the search key to the storage system client. The offset between the third starting address and the target address is equal to the length of the second message header and the second control field.存储系统客户端根据查找键从预设数据结构中查找匹配信息,通知客户端上层应用第二用户数据准备就绪;The storage system client searches for matching information from a preset data structure according to the search key and notifies the upper-layer application of the client that the second user data is ready;存储系统客户端从目标地址开始读取第二用户数据。The storage system client reads the second user data starting from the target address.5.如权利要求4所述的小IO免拷贝数据通信方法,其特征在于,对于存储系统服务端,所述通过RDMA write的方式将读回复消息从第三起始地址开始写入客户端上层应用管理的内存池,将查找键发送至存储系统客户端的步骤包括:5. The small IO copy-free data communication method according to claim 4, wherein, for the storage system server, the step of writing the read reply message starting from the third starting address into the memory pool managed by the client upper-layer application via RDMA write and sending the search key to the storage system client comprises:将查找键作为立即数,通过带立即数的RDMA write的方式将读回复消息从第三起始地址开始写入客户端上层应用管理的内存池。The search key is used as an immediate value, and the read reply message is written into the memory pool managed by the client upper-layer application starting from the third starting address through the RDMA write method with an immediate value.6.如权利要求4所述的小IO免拷贝数据通信方法,其特征在于,存储系统客户端包括存储适配层、RPC层和通信层,第二消息头部包括RPC层头部和存储适配层头部,预设数据结构位于RPC层;6. The small IO copy-free data communication method according to claim 4, wherein the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer; the second message header includes an RPC layer header and a storage adaptation layer header; and the preset data structure is located in the RPC layer;对于存储系统客户端,所述根据查找键从预设数据结构中查找匹配信息,通知客户端上层应用第二用户数据准备就绪的步骤包括:For the storage system client, the step of searching for matching information from a preset data structure according to the search key and notifying an upper-layer application of the client that the second user data is ready includes:通信层将查找键上报至RPC层;The communication layer reports the search key to the RPC layer;RPC层根据查找键从预设数据结构中查找匹配信息,通过存储适配层提供的回调函数进行反序列化,通知存储适配层第二用户数据准备就绪;The RPC layer searches for matching information from the preset data structure based on the search key, deserializes it through the callback function provided by the storage adaptation layer, and notifies the storage adaptation layer that the second user data is ready;存储适配层通知客户端上层应用第二用户数据准备就绪。The storage adaptation layer notifies the upper-layer application on the client that the second user data is ready.7.如权利要求4所述的小IO免拷贝数据通信方法,其特征在于,存储系统客户端包括存储适配层、RPC层和通信层,第二消息头部包括RPC层头部和存储适配层头部,预设数据结构位于通知层;7. The small IO copy-free data communication method according to claim 4, wherein the storage system client includes a storage adaptation layer, an RPC layer, and a communication layer, the second message header includes an RPC layer header and a storage adaptation layer header, and the preset data structure is located in the notification layer;对于存储系统客户端,所述根据查找键从预设数据结构中查找匹配信息,通知客户端上层应用第二用户数据准备就绪的步骤包括:For the storage system client, the step of searching for matching information from a preset data structure according to the search key and notifying an upper-layer application of the client that the second user data is ready includes:通信层根据查找键从预设数据结构中查找匹配信息,将匹配信息上报至RPC层;The communication layer searches for matching information from the preset data structure based on the search key and reports the matching information to the RPC layer;RPC层通过存储适配层提供的回调函数进行反序列化,通知存储适配层第二用户数据准备就绪;The RPC layer deserializes the data through the callback function provided by the storage adaptation layer, notifying the storage adaptation layer that the second user data is ready.存储适配层通知客户端上层应用第二用户数据准备就绪。The storage adaptation layer notifies the upper-layer application on the client that the second user data is ready.8.如权利要求4所述的小IO免拷贝数据通信方法,其特征在于,当预设数据结构为栈时,查找键为索引;8. The small IO copy-free data communication method according to claim 4, wherein when the preset data structure is a stack, the search key is an index;当预设数据结构为哈希表时,查找键为key。When the preset data structure is a hash table, the search key is key.9.如权利要求4至8中任一项所述的小IO免拷贝数据通信方法,其特征在于,所述小IO免拷贝数据通信方法还包括:9. The small IO copy-free data communication method according to any one of claims 4 to 8, further comprising:客户端上层应用向存储系统客户端下发小IO写业务并指定源地址,其中,源地址位于客户端上层应用管理的内存池中,源地址的前方预留足够存放第一消息头部和第一控制字段的空余空间,源地址的后方存放待写入存储系统服务端的第一用户数据;The client upper-layer application sends a small IO write service to the storage system client and specifies a source address. The source address is located in the memory pool managed by the client upper-layer application. Sufficient free space is reserved in front of the source address to store the first message header and the first control field. The first user data to be written to the storage system server is stored after the source address.存储系统客户端在第一起始地址和源地址之间填充第一消息头部和第一控制字段,将第一消息头部、第一控制字段和第一用户数据作为写请求消息发送至存储系统服务端,其中,第一起始地址和源地址之间的偏移量等于第一消息头部和第一控制字段的长度;The storage system client fills a first message header and a first control field between the first start address and the source address, and sends the first message header, the first control field, and the first user data as a write request message to the storage system server, wherein the offset between the first start address and the source address is equal to the length of the first message header and the first control field;存储系统服务端根据写请求消息执行第一用户数据的写入操作。The storage system server performs a write operation on the first user data according to the write request message.
CN202510650536.6A2025-05-202025-05-20 Small IO copy-free data communication method based on RDMAActiveCN120179572B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510650536.6ACN120179572B (en)2025-05-202025-05-20 Small IO copy-free data communication method based on RDMA

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510650536.6ACN120179572B (en)2025-05-202025-05-20 Small IO copy-free data communication method based on RDMA

Publications (2)

Publication NumberPublication Date
CN120179572A CN120179572A (en)2025-06-20
CN120179572Btrue CN120179572B (en)2025-09-02

Family

ID=96041766

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510650536.6AActiveCN120179572B (en)2025-05-202025-05-20 Small IO copy-free data communication method based on RDMA

Country Status (1)

CountryLink
CN (1)CN120179572B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109428861A (en)*2017-08-292019-03-05阿里巴巴集团控股有限公司Network communication method and equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10403335B1 (en)*2018-06-042019-09-03Micron Technology, Inc.Systems and methods for a centralized command address input buffer
FR3092218B1 (en)*2019-01-282023-12-22Ateme Data communication method, and system for implementing the method
WO2020158347A1 (en)*2019-01-302020-08-06日本電信電話株式会社Information processing device, method, and program
CN116431070A (en)*2023-04-132023-07-14济南浪潮数据技术有限公司RDMA-based copy reading method, device, equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109428861A (en)*2017-08-292019-03-05阿里巴巴集团控股有限公司Network communication method and equipment

Also Published As

Publication numberPublication date
CN120179572A (en)2025-06-20

Similar Documents

PublicationPublication DateTitle
CN110177118B (en)RDMA-based RPC communication method
CN108270676B (en)Network data processing method and device based on Intel DPDK
EP3873062A1 (en)Data transmission method, system, and proxy server
CN111404931B (en) A method of remote data transmission based on persistent memory
US6799200B1 (en)Mechanisms for efficient message passing with copy avoidance in a distributed system
US8271669B2 (en)Method and system for extended steering tags (STAGS) to minimize memory bandwidth for content delivery servers
KR20140106588A (en)Application consistent snapshots of a shared volume
WO2023098050A1 (en)Remote data access method and apparatus
CN115801770A (en)Large file transmission method based on full-user-state QUIC protocol
WO2018107433A1 (en)Information processing method and device
CN110445580B (en)Data transmission method and device, storage medium, and electronic device
US6374248B1 (en)Method and apparatus for providing local path I/O in a distributed file system
CN116339640A (en)Data handling method, system and related device
CN113127139A (en)Memory allocation method and device based on data plane development kit DPDK
CN120179572B (en) Small IO copy-free data communication method based on RDMA
US20090157896A1 (en)Tcp offload engine apparatus and method for system call processing for static file transmission
CN111339541A (en) Multiplexing method and device of IPC mechanism for inter-process communication based on binder drive
US11622403B2 (en)Data sending method, apparatus, and system
CN118540336A (en)Distributed data transmission method and distributed data access system
US20230244417A1 (en)Storage node, storage device, and network chip
KR100423391B1 (en)A Processing Method of the Distributed Forwarding Table in the High Speed Router
CN115022224B (en) A method, device and equipment for processing session establishment
CN117290084A (en)Memory pool operation method and device
CN112019645B (en)TOE-based network address management method and device
US20120016854A1 (en)File-sharing system and method for managing files, and program

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp