Movatterモバイル変換


[0]ホーム

URL:


CN118535356A - A dynamic shared memory multiplexing method and device suitable for GPGPU - Google Patents

A dynamic shared memory multiplexing method and device suitable for GPGPU
Download PDF

Info

Publication number
CN118535356A
CN118535356ACN202411000879.XACN202411000879ACN118535356ACN 118535356 ACN118535356 ACN 118535356ACN 202411000879 ACN202411000879 ACN 202411000879ACN 118535356 ACN118535356 ACN 118535356A
Authority
CN
China
Prior art keywords
shared memory
memory
read
address space
gpgpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411000879.XA
Other languages
Chinese (zh)
Other versions
CN118535356B (en
Inventor
颜佳宁
王帅
赵鑫鑫
姜凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanqixin Shandong Semiconductor Technology Co ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co LtdfiledCriticalShandong Inspur Science Research Institute Co Ltd
Priority to CN202411000879.XApriorityCriticalpatent/CN118535356B/en
Publication of CN118535356ApublicationCriticalpatent/CN118535356A/en
Application grantedgrantedCritical
Publication of CN118535356BpublicationCriticalpatent/CN118535356B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开一种适用GPGPU的动态共享内存多路复用方法及装置,涉及内存资源分配技术领域;包括:步骤1:基于GPGPU框架,根据各线程块发出的内存任务请求,进行线程块的内存需求分析:步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,步骤2:当线程块完成内存任务请求,释放共享内存并提供给其他线程块使用;本发明有效提升了GPGPU的计算效率和内存资源的利用率,显著增加了系统的吞吐量,同时降低了线程块的等待时间,提高了内存的并发访问能力。

The present invention discloses a dynamic shared memory multiplexing method and device suitable for GPGPU, which relates to the technical field of memory resource allocation; the method comprises: step 1: based on the GPGPU framework, according to the memory task request issued by each thread block, the memory demand analysis of the thread block is performed; step 11: thread blocks with the same data access demand are allowed to share a reusable shared memory partition in a shared memory pool; step 12: for thread blocks without the same data access demand, a read-write lock is allocated for locking access to a non-reusable shared memory partition in the shared memory pool at a given time; step 2: when the thread block completes the memory task request, the shared memory is released and provided to other thread blocks for use; the present invention effectively improves the computing efficiency of GPGPU and the utilization rate of memory resources, significantly increases the throughput of the system, and at the same time reduces the waiting time of the thread block, and improves the concurrent access capability of the memory.

Description

Translated fromChinese
一种适用GPGPU的动态共享内存多路复用方法及装置A dynamic shared memory multiplexing method and device suitable for GPGPU

技术领域Technical Field

本发明公开一种适用GPGPU的动态共享内存多路复用方法及装置,涉及内存资源分配技术领域。The invention discloses a dynamic shared memory multiplexing method and device suitable for GPGPU, and relates to the technical field of memory resource allocation.

背景技术Background Art

通用图形处理单元(GPGPU)作为一种高效的并行计算平台,广泛应用于科学计算、图形处理、大数据分析等领域。然而,GPGPU架构中的共享内存管理方式往往采用静态分配策略,这种策略在多线程并发访问时存在效率低下的问题,限制了GPGPU的性能发挥。现有的GPGPU设计中,共享内存的分配与线程块的生命周期绑定,即便实际使用时间远小于分配时间,一旦分配即占用整个执行期间,这导致共享内存资源的利用率不高,且难以支持更多线程块的同时执行,影响了GPGPU的线程级并行性和整体计算吞吐量。As an efficient parallel computing platform, general-purpose graphics processing units (GPGPUs) are widely used in scientific computing, graphics processing, big data analysis and other fields. However, the shared memory management method in the GPGPU architecture often adopts a static allocation strategy, which is inefficient when accessed by multiple threads concurrently, limiting the performance of GPGPU. In existing GPGPU designs, the allocation of shared memory is bound to the life cycle of thread blocks. Even if the actual usage time is much less than the allocation time, once allocated, it will occupy the entire execution period. This leads to low utilization of shared memory resources and makes it difficult to support the simultaneous execution of more thread blocks, affecting the thread-level parallelism and overall computing throughput of GPGPU.

发明内容Summary of the invention

本发明针对现有技术的问题,提供一种适用GPGPU的动态共享内存多路复用方法及装置,能够在不增加硬件资源的前提下,显著提高共享内存的利用效率和GPGPU的计算吞吐量。通过本发明,可以更灵活地管理和调度共享内存资源,允许多个线程块高效共享和复用有限的共享内存空间,从而突破性能瓶颈。The present invention aims to solve the problems of the prior art and provides a method and device for dynamic shared memory multiplexing suitable for GPGPU, which can significantly improve the utilization efficiency of shared memory and the computing throughput of GPGPU without increasing hardware resources. Through the present invention, shared memory resources can be managed and scheduled more flexibly, allowing multiple thread blocks to efficiently share and reuse limited shared memory space, thereby breaking through performance bottlenecks.

本发明提出的具体方案是:The specific scheme proposed by the present invention is:

本发明提供一种适用GPGPU的动态共享内存多路复用方法,包括:The present invention provides a dynamic shared memory multiplexing method applicable to GPGPU, comprising:

步骤1:基于GPGPU框架,根据各线程块发出的内存任务请求,进行线程块的内存需求分析:Step 1: Based on the GPGPU framework, perform memory requirement analysis on thread blocks according to the memory task requests issued by each thread block:

步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.

步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放;Step 12: for thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segment signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.

步骤2:当线程块完成内存任务请求,释放共享内存并提供给其他线程块使用。Step 2: When the thread block completes the memory task request, the shared memory is released and provided to other thread blocks.

进一步,所述的一种适用GPGPU的动态共享内存多路复用方法的步骤1中还包括步骤10:根据线程块的内存需求,判断内存任务请求对应的地址空间是否存在有效数据以及共享内存池是否有足够的分配空间,若是则进行步骤11和步骤12。Furthermore, step 1 of the dynamic shared memory multiplexing method applicable to GPGPU also includes step 10: judging whether there is valid data in the address space corresponding to the memory task request and whether there is sufficient allocation space in the shared memory pool according to the memory requirements of the thread block, and if so, proceeding to steps 11 and 12.

进一步,所述的一种适用GPGPU的动态共享内存多路复用方法的步骤10中利用共享内存记分板记录共享内存中地址空间的有效信号,当地址空间被写入数据时,有效信号被拉高,视为地址空间存在有效数据,否则视为地址空间无有效数据,Furthermore, in step 10 of the dynamic shared memory multiplexing method applicable to GPGPU, a shared memory scoreboard is used to record a valid signal of the address space in the shared memory. When data is written into the address space, the valid signal is pulled high, and it is regarded that there is valid data in the address space. Otherwise, it is regarded that there is no valid data in the address space.

同时,利用寄存器记录共享内存的剩余空间大小,用于判断共享内存剩余空间是否足够分配。At the same time, the register is used to record the size of the remaining space of the shared memory to determine whether the remaining space of the shared memory is sufficient for allocation.

进一步,所述的一种适用GPGPU的动态共享内存多路复用方法的步骤11中根据内存任务请求中线程块访问共享内存的地址判断线程块是否具有相同数据访问需求,若线程块之间访问共享内存的地址相同,则具有相同数据访问需求,否则线程块之间没有相同数据访问需求。Furthermore, in step 11 of the dynamic shared memory multiplexing method applicable to GPGPU, it is determined whether the thread blocks have the same data access requirements based on the addresses of the thread blocks accessing the shared memory in the memory task request. If the addresses of the thread blocks accessing the shared memory are the same, they have the same data access requirements; otherwise, the thread blocks do not have the same data access requirements.

进一步,所述的一种适用GPGPU的动态共享内存多路复用方法的步骤2中将释放的共享内存的零散地址空间重新映射,使地址空间集中在连续的共享内存分区中,以便后续共享内存池分配连续的地址空间,以及进行连续地址空间的读取。Furthermore, in step 2 of the dynamic shared memory multiplexing method applicable to GPGPU, the released scattered address space of the shared memory is remapped so that the address space is concentrated in a continuous shared memory partition, so that the shared memory pool can subsequently allocate a continuous address space and read the continuous address space.

本发明还提供一种适用GPGPU的动态共享内存多路复用装置,包括需求分配模块和资源调整模块,The present invention also provides a dynamic shared memory multiplexing device suitable for GPGPU, including a demand allocation module and a resource adjustment module.

需求分配模块基于GPGPU框架,根据各线程块发出的内存任务请求,执行线程块的内存需求分析:The demand allocation module is based on the GPGPU framework and performs memory demand analysis of thread blocks according to the memory task requests issued by each thread block:

步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.

步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放;Step 12: for thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segment signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.

当线程块完成内存任务请求,资源调整模块释放共享内存并提供给其他线程块使用。When a thread block completes a memory task request, the resource adjustment module releases the shared memory and provides it to other thread blocks.

进一步,所述的一种适用GPGPU的动态共享内存多路复用装置的需求分配模块还执行步骤10:根据线程块的内存需求,判断内存任务请求对应的地址空间是否存在有效数据以及共享内存池是否有足够的分配空间,若是则执行步骤11和步骤12。Furthermore, the demand allocation module of the dynamic shared memory multiplexing device suitable for GPGPU also executes step 10: according to the memory requirements of the thread block, it is determined whether there is valid data in the address space corresponding to the memory task request and whether the shared memory pool has sufficient allocation space, and if so, steps 11 and 12 are executed.

进一步,所述的一种适用GPGPU的动态共享内存多路复用装置的需求分配模块执行步骤10时,利用共享内存记分板记录共享内存中地址空间的有效信号,当地址空间被写入数据时,有效信号被拉高,视为地址空间存在有效数据,否则视为地址空间无有效数据,Furthermore, when the demand allocation module of the dynamic shared memory multiplexing device suitable for GPGPU executes step 10, the shared memory scoreboard is used to record the valid signal of the address space in the shared memory. When data is written into the address space, the valid signal is pulled high, and it is regarded that there is valid data in the address space. Otherwise, it is regarded that there is no valid data in the address space.

同时,利用寄存器记录共享内存的剩余空间大小,用于判断共享内存剩余空间是否足够分配。At the same time, the register is used to record the size of the remaining space of the shared memory to determine whether the remaining space of the shared memory is sufficient for allocation.

进一步,所述的一种适用GPGPU的动态共享内存多路复用装置的需求分配模块执行步骤11时,根据内存任务请求中线程块访问共享内存的地址判断线程块是否具有相同数据访问需求,若线程块之间访问共享内存的地址相同,则具有相同数据访问需求,否则线程块之间没有相同数据访问需求。Furthermore, when the demand allocation module of the dynamic shared memory multiplexing device suitable for GPGPU executes step 11, it determines whether the thread blocks have the same data access requirements based on the addresses of the thread blocks accessing the shared memory in the memory task request. If the addresses of the thread blocks accessing the shared memory are the same, they have the same data access requirements; otherwise, the thread blocks do not have the same data access requirements.

进一步,所述的一种适用GPGPU的动态共享内存多路复用装置的资源调整模块将释放的共享内存的零散地址空间重新映射,使地址空间集中在连续的共享内存分区中,以便后续共享内存池分配连续的地址空间,以及进行连续地址空间的读取。Furthermore, the resource adjustment module of the dynamic shared memory multiplexing device suitable for GPGPU remaps the released scattered address space of the shared memory so that the address space is concentrated in a continuous shared memory partition, so that the shared memory pool can subsequently allocate continuous address space and read the continuous address space.

本发明的有益之处是:The benefits of the present invention are:

本发明复用并动态分配内存资源,确保了只有实际需要时才进行内存分配,有效避免了内存的不必要占用和浪费,实现了内存资源的高效流转。此外,快速释放机制确保了线程块完成计算任务后立即回收其占用的内存资源,显著减少了线程块因等待内存资源而产生的延迟,提升了线程调度的响应速度;The present invention reuses and dynamically allocates memory resources, ensuring that memory is allocated only when actually needed, effectively avoiding unnecessary memory occupation and waste, and achieving efficient circulation of memory resources. In addition, the fast release mechanism ensures that the memory resources occupied by the thread block are immediately recovered after the computing task is completed, significantly reducing the delay caused by the thread block waiting for memory resources and improving the response speed of thread scheduling;

多路复用策略允许多个线程块共享同一块内存资源,减少了因等待内存分配而产生的空闲时间,从而加快了线程块的执行速度,这种时间上的重叠使用减少了内存资源的闲置,使得更多的计算任务能够同时进行,提升了整体的计算吞吐量;The multiplexing strategy allows multiple thread blocks to share the same memory resource, reducing the idle time caused by waiting for memory allocation, thereby speeding up the execution of thread blocks. This overlapping use of time reduces the idleness of memory resources, allowing more computing tasks to be performed simultaneously, and improving the overall computing throughput;

搭配适应性内存调整能力,能够根据应用程序的具体特征和运行时行为,自适应地调整内存分配策略,实现最优的资源利用和性能表现,这种灵活性和扩展性,使得系统能够适应不同规模和特性的GPGPU应用程序;With adaptive memory adjustment capability, the system can adaptively adjust the memory allocation strategy according to the specific characteristics and runtime behavior of the application to achieve optimal resource utilization and performance. This flexibility and scalability enables the system to adapt to GPGPU applications of different scales and characteristics.

整体而言,本发明有效提升了GPGPU的计算效率和内存资源的利用率,显著增加了系统的吞吐量,同时降低了线程块的等待时间,提高了内存的并发访问能力,为用户提供了一个更加高效、稳定和可靠的计算环境。Overall, the present invention effectively improves the computing efficiency of GPGPU and the utilization of memory resources, significantly increases the throughput of the system, reduces the waiting time of thread blocks, improves the concurrent access capability of memory, and provides users with a more efficient, stable and reliable computing environment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1本发明方法流程示意图。Fig. 1 is a schematic flow chart of the method of the present invention.

图2是本发明涉及的具有相同数据访问需求的线程块共享可复用的共享内存分区流程示意图。FIG2 is a schematic diagram of a flow chart of a reusable shared memory partition shared by thread blocks with the same data access requirements according to the present invention.

图3是本发明涉及的动态分配内存流程示意图。FIG. 3 is a schematic diagram of a dynamic memory allocation process according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施例对本发明作进一步说明,以使本领域的技术人员可以更好地理解本发明并能予以实施,但所举实施例不作为对本发明的限定。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments so that those skilled in the art can better understand the present invention and implement it, but the embodiments are not intended to limit the present invention.

实施例1:本发明提供一种适用GPGPU的动态共享内存多路复用方法,包括:Embodiment 1: The present invention provides a dynamic shared memory multiplexing method applicable to GPGPU, comprising:

步骤1:基于GPGPU框架,根据各线程块发出的内存任务请求,进行线程块的内存需求分析:Step 1: Based on the GPGPU framework, perform memory requirement analysis on thread blocks according to the memory task requests issued by each thread block:

步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区。Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.

其中根据内存任务请求中线程块访问共享内存的地址判断线程块是否具有相同数据访问需求,若线程块之间访问共享内存的地址相同,则具有相同数据访问需求,否则线程块之间没有相同数据访问需求。Whether the thread blocks have the same data access requirements is determined based on the addresses of the thread blocks accessing the shared memory in the memory task request. If the addresses of the thread blocks accessing the shared memory are the same, they have the same data access requirements. Otherwise, the thread blocks do not have the same data access requirements.

当其中一个具有相同数据访问需求的线程块读取可复用的共享内存分区的地址空间内数据后,可直接将该数据返回给其他具有相同数据访问需求的线程块,达到多路复用的效果。When one of the thread blocks with the same data access requirement reads the data in the address space of the reusable shared memory partition, the data can be directly returned to other thread blocks with the same data access requirement, thus achieving a multiplexing effect.

步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放。Step 12: For thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segmentation signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.

步骤2:当线程块完成内存任务请求,释放共享内存并提供给其他线程块使用。Step 2: When the thread block completes the memory task request, the shared memory is released and provided to other thread blocks.

例如,结合图2,线程块1至线程块n请求内存,对线程块1至线程块n进行需求分析,比如线程块1和线程块n具有相同数据访问需求,在共享内存池中共享可复用的共享内存分区,当线程块1和线程块n均完成请求内存任务,释放相应的共享内存分区,同时可判断是否存在其他线程块需此共享内存分区,若存在线程块2需要此共享内存分区,则分配给线程块2执行内存任务,完成线程块2的内存任务后,释放此共享内存分区,使其回归共享内存池。For example, in conjunction with Figure 2, thread blocks 1 to thread blocks n request memory, and a demand analysis is performed on thread blocks 1 to thread blocks n. For example, thread blocks 1 and thread blocks n have the same data access requirements, and share a reusable shared memory partition in the shared memory pool. When thread blocks 1 and thread blocks n complete the memory request task, the corresponding shared memory partition is released. At the same time, it can be determined whether there are other thread blocks that need this shared memory partition. If there is a thread block 2 that needs this shared memory partition, it is assigned to thread block 2 to perform the memory task. After completing the memory task of thread block 2, this shared memory partition is released and returned to the shared memory pool.

实施例2:基于实施例1,本发明方法的步骤1中还包括步骤10:根据线程块的内存需求,判断内存任务请求对应的地址空间是否存在有效数据以及共享内存池是否有足够的分配空间,若是则进行步骤11和步骤12。Embodiment 2: Based on embodiment 1, step 1 of the method of the present invention also includes step 10: according to the memory requirements of the thread block, determine whether there is valid data in the address space corresponding to the memory task request and whether the shared memory pool has sufficient allocation space, and if so, proceed to steps 11 and 12.

其中可进一步利用共享内存记分板记录共享内存中地址空间的有效信号,当地址空间被写入数据时,有效信号被拉高,视为地址空间存在有效数据,否则视为地址空间无有效数据。其中若对应地址的有效位未置一,视为地址不存在数据,分配失败,报错并返回该内存任务请求,等待下一次请求发送。The shared memory scoreboard can be further used to record the valid signal of the address space in the shared memory. When data is written to the address space, the valid signal is pulled high, which is regarded as valid data in the address space, otherwise it is regarded as no valid data in the address space. If the valid bit of the corresponding address is not set to 1, it is regarded as the address does not have data, the allocation fails, an error is reported, and the memory task request is returned, waiting for the next request to be sent.

同时,利用寄存器记录共享内存的剩余空间大小,用于判断共享内存剩余空间是否足够分配。若剩余大小小于请求空间大小,视为空间不足,分配失败,暂时挂起该内存任务请求,直到共享内存池中有足够的内存大小满足请求,此时回应请求并对发出请求的线程块进行内存分配。At the same time, the register is used to record the remaining space of the shared memory to determine whether the remaining space of the shared memory is sufficient for allocation. If the remaining size is less than the requested space size, it is considered that there is insufficient space and the allocation fails. The memory task request is temporarily suspended until there is enough memory in the shared memory pool to meet the request. At this time, the request is responded to and memory is allocated to the thread block that issued the request.

例如,结合图3,当线程块发起内存访问请求,先判断需访问的共享内存对应的地址空间是否存在有效数据,是则判断共享内存池内可分配的空间大小,空间不足则进行等待,空间满足分配需求则进行内存分配,供线程块执行任务,若任务执行完成则快速释放内存,实现资源回收,同时检测是否存在其他内存访问请求,若无则结束任务,若有则适应性调整内存,比如步骤2中将释放的共享内存的零散地址空间重新映射,使地址空间集中在连续的共享内存分区中,以便后续共享内存池分配连续的地址空间,以及进行连续地址空间的读取。For example, in conjunction with Figure 3, when a thread block initiates a memory access request, it first determines whether there is valid data in the address space corresponding to the shared memory to be accessed. If so, it determines the size of the allocatable space in the shared memory pool. If the space is insufficient, it waits. If the space meets the allocation requirements, it allocates memory for the thread block to execute the task. If the task is completed, the memory is quickly released to achieve resource recovery. At the same time, it detects whether there are other memory access requests. If not, the task is terminated. If so, the memory is adaptively adjusted. For example, in step 2, the scattered address space of the released shared memory is remapped so that the address space is concentrated in a continuous shared memory partition, so that the shared memory pool can subsequently allocate continuous address space and read the continuous address space.

实施例3:本发明还提供一种适用GPGPU的动态共享内存多路复用装置,包括需求分配模块和资源调整模块,Embodiment 3: The present invention also provides a dynamic shared memory multiplexing device suitable for GPGPU, including a demand allocation module and a resource adjustment module.

需求分配模块基于GPGPU框架,根据各线程块发出的内存任务请求,执行线程块的内存需求分析:The demand allocation module is based on the GPGPU framework and performs memory demand analysis of thread blocks according to the memory task requests issued by each thread block:

步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.

步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放;Step 12: for thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segment signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.

当线程块完成内存任务请求,资源调整模块释放共享内存并提供给其他线程块使用。When a thread block completes a memory task request, the resource adjustment module releases the shared memory and provides it to other thread blocks.

上述装置内各模块间信息交互、执行过程等内容,由于与本发明方法实施例基于同一构思,具体内容可参见本发明方法实施例中的叙述,此处不再赘述。As the information interaction and execution process between the modules in the above-mentioned device are based on the same concept as the embodiment of the method of the present invention, the specific contents can be found in the description of the embodiment of the method of the present invention and will not be repeated here.

同样地,本发明装置复用并动态分配内存资源,确保了只有实际需要时才进行内存分配,有效避免了内存的不必要占用和浪费,实现了内存资源的高效流转。此外,快速释放机制确保了线程块完成计算任务后立即回收其占用的内存资源,显著减少了线程块因等待内存资源而产生的延迟,提升了线程调度的响应速度;Similarly, the device of the present invention reuses and dynamically allocates memory resources, ensuring that memory allocation is performed only when actually needed, effectively avoiding unnecessary memory occupation and waste, and achieving efficient circulation of memory resources. In addition, the fast release mechanism ensures that the memory resources occupied by the thread block are immediately recovered after the computing task is completed, significantly reducing the delay caused by the thread block waiting for memory resources and improving the response speed of thread scheduling;

多路复用策略允许多个线程块共享同一块内存资源,减少了因等待内存分配而产生的空闲时间,从而加快了线程块的执行速度,这种时间上的重叠使用减少了内存资源的闲置,使得更多的计算任务能够同时进行,提升了整体的计算吞吐量;The multiplexing strategy allows multiple thread blocks to share the same memory resource, reducing the idle time caused by waiting for memory allocation, thereby speeding up the execution of thread blocks. This overlapping use of time reduces the idleness of memory resources, allowing more computing tasks to be performed simultaneously, and improving the overall computing throughput;

搭配适应性内存调整能力,能够根据应用程序的具体特征和运行时行为,自适应地调整内存分配策略,实现最优的资源利用和性能表现,这种灵活性和扩展性,使得系统能够适应不同规模和特性的GPGPU应用程序;With adaptive memory adjustment capability, the system can adaptively adjust the memory allocation strategy according to the specific characteristics and runtime behavior of the application to achieve optimal resource utilization and performance. This flexibility and scalability enables the system to adapt to GPGPU applications of different scales and characteristics.

整体而言,本发明装置有效提升了GPGPU的计算效率和内存资源的利用率,显著增加了系统的吞吐量,同时降低了线程块的等待时间,提高了内存的并发访问能力,为用户提供了一个更加高效、稳定和可靠的计算环境。In general, the device of the present invention effectively improves the computing efficiency of GPGPU and the utilization rate of memory resources, significantly increases the throughput of the system, reduces the waiting time of thread blocks, improves the concurrent access capability of memory, and provides users with a more efficient, stable and reliable computing environment.

需要说明的是,上述各流程和各装置结构中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。上述各实施例中描述的系统结构可以是物理结构,也可以是逻辑结构,即,有些模块可能由同一物理实体实现,或者,有些模块可能分由多个物理实体实现,或者,可以由多个独立设备中的某些部件共同实现。It should be noted that not all steps and modules in the above-mentioned processes and device structures are necessary, and some steps or modules can be ignored according to actual needs. The execution order of each step is not fixed and can be adjusted as needed. The system structure described in the above-mentioned embodiments can be a physical structure or a logical structure, that is, some modules may be implemented by the same physical entity, or some modules may be implemented by multiple physical entities, or some components in multiple independent devices may be implemented together.

以上所述实施例仅是为充分说明本发明而所举的较佳的实施例,本发明的保护范围不限于此。本技术领域的技术人员在本发明基础上所作的等同替代或变换,均在本发明的保护范围之内。本发明的保护范围以权利要求书为准。The above-described embodiments are only preferred embodiments for fully illustrating the present invention, and the protection scope of the present invention is not limited thereto. Equivalent substitutions or changes made by those skilled in the art based on the present invention are within the protection scope of the present invention. The protection scope of the present invention shall be subject to the claims.

Claims (10)

Translated fromChinese
1.一种适用GPGPU的动态共享内存多路复用方法,其特征是包括:1. A dynamic shared memory multiplexing method suitable for GPGPU, characterized by comprising:步骤1:基于GPGPU框架,根据各线程块发出的内存任务请求,进行线程块的内存需求分析:Step 1: Based on the GPGPU framework, perform memory requirement analysis on thread blocks according to the memory task requests issued by each thread block:步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放;Step 12: for thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segment signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.步骤2:当线程块完成内存任务请求,释放共享内存并提供给其他线程块使用。Step 2: When the thread block completes the memory task request, the shared memory is released and provided to other thread blocks.2.根据权利要求1所述的一种适用GPGPU的动态共享内存多路复用方法,其特征是步骤1中还包括步骤10:根据线程块的内存需求,判断内存任务请求对应的地址空间是否存在有效数据以及共享内存池是否有足够的分配空间,若是则进行步骤11和步骤12。2. According to the dynamic shared memory multiplexing method suitable for GPGPU described in claim 1, it is characterized in that step 1 also includes step 10: according to the memory requirements of the thread block, it is determined whether there is valid data in the address space corresponding to the memory task request and whether the shared memory pool has sufficient allocation space, and if so, steps 11 and 12 are performed.3.根据权利要求2所述的一种适用GPGPU的动态共享内存多路复用方法,其特征是步骤10中利用共享内存记分板记录共享内存中地址空间的有效信号,当地址空间被写入数据时,有效信号被拉高,视为地址空间存在有效数据,否则视为地址空间无有效数据,3. A dynamic shared memory multiplexing method suitable for GPGPU according to claim 2, characterized in that in step 10, a shared memory scoreboard is used to record the valid signal of the address space in the shared memory, when the address space is written with data, the valid signal is pulled high, and it is regarded that there is valid data in the address space, otherwise it is regarded that there is no valid data in the address space,同时,利用寄存器记录共享内存的剩余空间大小,用于判断共享内存剩余空间是否足够分配。At the same time, the register is used to record the size of the remaining space of the shared memory to determine whether the remaining space of the shared memory is sufficient for allocation.4.根据权利要求1所述的一种适用GPGPU的动态共享内存多路复用方法,其特征是步骤11中根据内存任务请求中线程块访问共享内存的地址判断线程块是否具有相同数据访问需求,若线程块之间访问共享内存的地址相同,则具有相同数据访问需求,否则线程块之间没有相同数据访问需求。4. According to the dynamic shared memory multiplexing method suitable for GPGPU as described in claim 1, it is characterized in that in step 11, it is determined whether the thread blocks have the same data access requirements based on the addresses of the thread blocks accessing the shared memory in the memory task request; if the addresses of the thread blocks accessing the shared memory are the same, they have the same data access requirements; otherwise, the thread blocks do not have the same data access requirements.5.根据权利要求1所述的一种适用GPGPU的动态共享内存多路复用方法,其特征是步骤2中将释放的共享内存的零散地址空间重新映射,使地址空间集中在连续的共享内存分区中,以便后续共享内存池分配连续的地址空间,以及进行连续地址空间的读取。5. The method for dynamic shared memory multiplexing applicable to GPGPU according to claim 1 is characterized in that in step 2, the scattered address space of the released shared memory is remapped so that the address space is concentrated in a continuous shared memory partition, so that the shared memory pool can subsequently allocate a continuous address space and read the continuous address space.6.一种适用GPGPU的动态共享内存多路复用装置,其特征是包括需求分配模块和资源调整模块,6. A dynamic shared memory multiplexing device suitable for GPGPU, characterized by comprising a demand allocation module and a resource adjustment module,需求分配模块基于GPGPU框架,根据各线程块发出的内存任务请求,执行线程块的内存需求分析:The demand allocation module is based on the GPGPU framework and performs memory demand analysis of thread blocks according to the memory task requests issued by each thread block:步骤11:允许具有相同数据访问需求的线程块在共享内存池中共享可复用的共享内存分区,Step 11: Allow thread blocks with the same data access requirements to share reusable shared memory partitions in the shared memory pool.步骤12:针对没有相同数据访问需求的线程块,分配读写锁用于在给定时刻锁定访问共享内存池中非复用的共享内存分区,其中利用读写锁索引表分配线程块读写锁,当读写锁索引表内对应的共享内存分段信号被拉高,表示所述共享内存分区已有线程块被分配读写锁,需要等待读写锁释放,当读写锁索引表内对应的共享内存分区信号被拉低,表示所述共享内存分区对应的读写锁已被释放;Step 12: for thread blocks that do not have the same data access requirements, a read-write lock is allocated to lock access to a non-reused shared memory partition in the shared memory pool at a given time, wherein the thread block read-write lock is allocated using a read-write lock index table. When the corresponding shared memory segment signal in the read-write lock index table is pulled high, it indicates that a thread block in the shared memory partition has been allocated a read-write lock and needs to wait for the read-write lock to be released. When the corresponding shared memory partition signal in the read-write lock index table is pulled low, it indicates that the read-write lock corresponding to the shared memory partition has been released.当线程块完成内存任务请求,资源调整模块释放共享内存并提供给其他线程块使用。When a thread block completes a memory task request, the resource adjustment module releases the shared memory and provides it to other thread blocks.7.根据权利要求6所述的一种适用GPGPU的动态共享内存多路复用装置,其特征是需求分配模块还执行步骤10:根据线程块的内存需求,判断内存任务请求对应的地址空间是否存在有效数据以及共享内存池是否有足够的分配空间,若是则执行步骤11和步骤12。7. A dynamic shared memory multiplexing device suitable for GPGPU according to claim 6, characterized in that the demand allocation module also executes step 10: based on the memory requirements of the thread block, it is determined whether there is valid data in the address space corresponding to the memory task request and whether the shared memory pool has sufficient allocation space, and if so, executes steps 11 and 12.8.根据权利要求7所述的一种适用GPGPU的动态共享内存多路复用装置,其特征是需求分配模块执行步骤10时,利用共享内存记分板记录共享内存中地址空间的有效信号,当地址空间被写入数据时,有效信号被拉高,视为地址空间存在有效数据,否则视为地址空间无有效数据,8. A dynamic shared memory multiplexing device suitable for GPGPU according to claim 7, characterized in that when the demand allocation module executes step 10, a shared memory scoreboard is used to record the valid signal of the address space in the shared memory, and when data is written into the address space, the valid signal is pulled high, and it is regarded that there is valid data in the address space, otherwise it is regarded that there is no valid data in the address space.同时,利用寄存器记录共享内存的剩余空间大小,用于判断共享内存剩余空间是否足够分配。At the same time, the register is used to record the size of the remaining space of the shared memory to determine whether the remaining space of the shared memory is sufficient for allocation.9.根据权利要求6所述的一种适用GPGPU的动态共享内存多路复用装置,其特征是需求分配模块执行步骤11时,根据内存任务请求中线程块访问共享内存的地址判断线程块是否具有相同数据访问需求,若线程块之间访问共享内存的地址相同,则具有相同数据访问需求,否则线程块之间没有相同数据访问需求。9. A dynamic shared memory multiplexing device suitable for GPGPU according to claim 6, characterized in that when the demand allocation module executes step 11, it determines whether the thread blocks have the same data access requirements based on the addresses of the thread blocks accessing the shared memory in the memory task request; if the addresses of the thread blocks accessing the shared memory are the same, they have the same data access requirements; otherwise, the thread blocks do not have the same data access requirements.10.根据权利要求6所述的一种适用GPGPU的动态共享内存多路复用装置,其特征是资源调整模块将释放的共享内存的零散地址空间重新映射,使地址空间集中在连续的共享内存分区中,以便后续共享内存池分配连续的地址空间,以及进行连续地址空间的读取。10. A dynamic shared memory multiplexing device suitable for GPGPU according to claim 6, characterized in that the resource adjustment module remaps the scattered address space of the released shared memory so that the address space is concentrated in a continuous shared memory partition, so that the subsequent shared memory pool allocates continuous address space and reads the continuous address space.
CN202411000879.XA2024-07-252024-07-25Dynamic shared memory multiplexing method and device applicable to GPGPUActiveCN118535356B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411000879.XACN118535356B (en)2024-07-252024-07-25Dynamic shared memory multiplexing method and device applicable to GPGPU

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411000879.XACN118535356B (en)2024-07-252024-07-25Dynamic shared memory multiplexing method and device applicable to GPGPU

Publications (2)

Publication NumberPublication Date
CN118535356Atrue CN118535356A (en)2024-08-23
CN118535356B CN118535356B (en)2024-10-29

Family

ID=92386890

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411000879.XAActiveCN118535356B (en)2024-07-252024-07-25Dynamic shared memory multiplexing method and device applicable to GPGPU

Country Status (1)

CountryLink
CN (1)CN118535356B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119557115A (en)*2025-01-262025-03-04山东浪潮科学研究院有限公司 GPGPU thread block synchronization method, device and medium based on thread lock

Citations (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JPH0784851A (en)*1993-09-131995-03-31Toshiba Corp Shared data management method
CN101551761A (en)*2009-04-302009-10-07浪潮电子信息产业股份有限公司Method for sharing stream memory of heterogeneous multi-processor
JP2010061522A (en)*2008-09-052010-03-18Internatl Business Mach Corp <Ibm>Computer system for permitting exclusive access to shared data, method for the computer system, and computer readable recording medium
WO2018119951A1 (en)*2016-12-292018-07-05深圳前海达闼云端智能科技有限公司Gpu virtualization method, device, system, and electronic apparatus, and computer program product
KR20180099420A (en)*2017-02-272018-09-05한국과학기술원Data processor proceeding of accelerated synchronization between central processing unit and graphics processing unit
CN109298935A (en)*2018-09-062019-02-01华泰证券股份有限公司A kind of method and application of the multi-process single-write and multiple-read without lock shared drive
CN109413432A (en)*2018-07-032019-03-01北京中科睿芯智能计算产业研究院有限公司Multi-process coding method, system and device based on event and shared drive mechanism
CN111736987A (en)*2020-05-292020-10-02山东大学 A task scheduling method based on GPU space resource sharing
CN112162855A (en)*2020-09-212021-01-01南开大学 GPU page miss processing method, system and medium based on page-locked memory
CN112463356A (en)*2020-10-272021-03-09苏州浪潮智能科技有限公司GPU heap manager memory address allocation method, system, terminal and storage medium
CN114880138A (en)*2022-04-222022-08-09烽火通信科技股份有限公司 A high-performance data model access method and device based on shared memory pool
CN116188246A (en)*2023-02-012023-05-30海元利亨(青岛)医疗器械有限公司Method for reading internal memory of 4K medical image
CN116302617A (en)*2023-05-122023-06-23苏州浪潮智能科技有限公司 Method of shared memory, communication method, embedded system and electronic device
CN118113471A (en)*2024-03-062024-05-31华中科技大学 A GPU sharing method and device for server-unaware inference load
CN118260099A (en)*2022-12-272024-06-28华为技术有限公司CC-NUMA server, lock request processing method and related device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JPH0784851A (en)*1993-09-131995-03-31Toshiba Corp Shared data management method
JP2010061522A (en)*2008-09-052010-03-18Internatl Business Mach Corp <Ibm>Computer system for permitting exclusive access to shared data, method for the computer system, and computer readable recording medium
CN101551761A (en)*2009-04-302009-10-07浪潮电子信息产业股份有限公司Method for sharing stream memory of heterogeneous multi-processor
WO2018119951A1 (en)*2016-12-292018-07-05深圳前海达闼云端智能科技有限公司Gpu virtualization method, device, system, and electronic apparatus, and computer program product
KR20180099420A (en)*2017-02-272018-09-05한국과학기술원Data processor proceeding of accelerated synchronization between central processing unit and graphics processing unit
CN109413432A (en)*2018-07-032019-03-01北京中科睿芯智能计算产业研究院有限公司Multi-process coding method, system and device based on event and shared drive mechanism
CN109298935A (en)*2018-09-062019-02-01华泰证券股份有限公司A kind of method and application of the multi-process single-write and multiple-read without lock shared drive
CN111736987A (en)*2020-05-292020-10-02山东大学 A task scheduling method based on GPU space resource sharing
CN112162855A (en)*2020-09-212021-01-01南开大学 GPU page miss processing method, system and medium based on page-locked memory
CN112463356A (en)*2020-10-272021-03-09苏州浪潮智能科技有限公司GPU heap manager memory address allocation method, system, terminal and storage medium
CN114880138A (en)*2022-04-222022-08-09烽火通信科技股份有限公司 A high-performance data model access method and device based on shared memory pool
CN118260099A (en)*2022-12-272024-06-28华为技术有限公司CC-NUMA server, lock request processing method and related device
CN116188246A (en)*2023-02-012023-05-30海元利亨(青岛)医疗器械有限公司Method for reading internal memory of 4K medical image
CN116302617A (en)*2023-05-122023-06-23苏州浪潮智能科技有限公司 Method of shared memory, communication method, embedded system and electronic device
CN118113471A (en)*2024-03-062024-05-31华中科技大学 A GPU sharing method and device for server-unaware inference load

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
徐延东;华蓓;: "面向GPU的内存管理与应用", 电子技术, no. 07, 25 July 2017 (2017-07-25)*
李涛;董前琨;张帅;孔令晏;康宏;杨愚鲁;: "基于线程池的GPU任务并行计算模式研究", 计算机学报, no. 10, 29 December 2017 (2017-12-29)*
杨建强;: "Java中读写锁的实现及分析", 电脑学习, no. 02, 1 April 2006 (2006-04-01)*
王磊;刘道福;陈云霁;陈天石;李玲;: "片上多核处理器共享资源分配与调度策略研究综述", 计算机研究与发展, no. 10, 21 March 2013 (2013-03-21)*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119557115A (en)*2025-01-262025-03-04山东浪潮科学研究院有限公司 GPGPU thread block synchronization method, device and medium based on thread lock
CN119557115B (en)*2025-01-262025-04-18山东浪潮科学研究院有限公司GPGPU thread block synchronization method, equipment and medium based on thread lock

Also Published As

Publication numberPublication date
CN118535356B (en)2024-10-29

Similar Documents

PublicationPublication DateTitle
CN108595258B (en) A dynamic extension method of GPGPU register file
CN1238793C (en)Distributed memory control and bandwidth optimization
CN105579961B (en) Data processing system and method of operation, hardware unit for data processing system
CN113918101B (en) A method, system, device and storage medium for writing data cache
US9086920B2 (en)Device for managing data buffers in a memory space divided into a plurality of memory elements
CN118535356B (en)Dynamic shared memory multiplexing method and device applicable to GPGPU
CN110750356A (en) Multi-core interactive method, system and storage medium suitable for non-volatile memory
CN1963762A (en) Stack management system and method
US7490223B2 (en)Dynamic resource allocation among master processors that require service from a coprocessor
CN1650266A (en) Supports time-division multiplexed speculative multithreading for single-threaded applications
US20090083496A1 (en)Method for Improved Performance With New Buffers on NUMA Systems
TWI881835B (en) Method and apparatus for configuring a relay register module, computing device, and computer-readable medium
CN1928811A (en)Processing operations management systems and methods
CN111078394A (en)GPU thread load balancing method and device
US11429299B2 (en)System and method for managing conversion of low-locality data into high-locality data
CN116450298A (en)GPU task fine granularity scheduling method and related device
CN118860921A (en) Dynamic mapping method for multi-bank access, electronic device and storage medium
CN119003177A (en)GPGPU-oriented on-chip resource management system
CN116483536B (en)Data scheduling method, computing chip and electronic equipment
CN111338782A (en) A Contention-Aware Node Allocation Method for Shared Burst Data Cache
CN1851651A (en)Method for realizing process priority scheduling for embedded SRAM operating system
Daoud et al.Processor allocation algorithm based on frame combing with memorization for 2d mesh cmps
WO2022242777A1 (en)Scheduling method, apparatus and system, and computing device
Zhu et al.EBIO: An Efficient Block I/O Stack for NVMe SSDs With Mixed Workloads
JP2002278778A (en) Scheduling device in symmetric multiprocessor system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
TR01Transfer of patent right
TR01Transfer of patent right

Effective date of registration:20250716

Address after:250000 Shandong Province, Jinan City, China (Shandong) Free Trade Pilot Zone, Shunhua Road Street, Inspur Road 1036, Building S01, 5th Floor

Patentee after:Yuanqixin (Shandong) Semiconductor Technology Co.,Ltd.

Country or region after:China

Address before:250000 building S02, No. 1036, Langchao Road, high tech Zone, Jinan City, Shandong Province

Patentee before:Shandong Inspur Scientific Research Institute Co.,Ltd.

Country or region before:China


[8]ページ先頭

©2009-2025 Movatter.jp