CN103345429A

Movatterモバイル変換

Info

Publication number: CN103345429A
Application number: CN2013102423985A
Authority: CN
Inventors: 刘垚; 陈明扬; 陈明宇; 阮元
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-06-19
Filing date: 2013-06-19
Publication date: 2013-10-09
Anticipated expiration: 2033-06-19
Also published as: CN103345429B

Abstract

The invention discloses a high-concurrency access and storage accelerator and accelerating method based on an on-chip RAM, and a processor adopting the method. The access and storage accelerator is independent of an on-chip Cache and an MSHR and connected with the on-chip RAM and an internal storage controller, and an uncompleted access and storage request is sent to the internal storage controller and an internal storage system through the access and storage accelerator. Therefore, the problem that a universal processor is limited in concurrency access and storage number in Internet and cloud computing application is solved, and high-concurrency access and storage is accelerated.

Description

Translated fromChinese

基于片上RAM的高并发访存加速方法、加速器及CPUHigh concurrent memory access acceleration method, accelerator and CPU based on on-chip RAM

技术领域technical field

本发明属计算机领域，涉及CPU内部的结构设计，尤其涉及一种基于片上RAM的高并发访存加速方法、加速器及CPU。The invention belongs to the field of computers and relates to the internal structure design of a CPU, in particular to an on-chip RAM-based high-concurrency memory access acceleration method, an accelerator and a CPU.

背景技术Background technique

随着互联网和云计算的发展，高并发的数据处理程序越来越多。这类程序通常需要处理以请求（request）或作业（job）的形式提交的大量并发负载，这些并发负载的核心业务通常涉及到海量数据的处理和分析。这类程序通常使用多线程或多进程，线程或进程之间有较低的访存依赖或无访存依赖。With the development of the Internet and cloud computing, there are more and more high-concurrency data processing programs. Such programs usually need to process a large number of concurrent loads submitted in the form of requests (requests) or jobs (jobs). The core business of these concurrent loads usually involves the processing and analysis of massive data. Such programs usually use multi-threads or multi-processes, with low or no memory access dependencies between threads or processes.

因此这类应用可以向内存系统发出大量的并发访存请求。这对访存系统的并发性提出了挑战。如果访存系统的并发性不够高，就会成为这类应用提高性能的瓶颈。Therefore, such applications can issue a large number of concurrent memory access requests to the memory system. This poses a challenge to the concurrency of the memory access system. If the concurrency of the memory access system is not high enough, it will become a bottleneck for improving the performance of such applications.

图1所示为典型的CPU存储结构。当CPU需要读取数据时，先到Cache中查找，如果Cache中已经存有所需数据（Cache hit），则直接将数据返回给CPU。如果CPU在Cache中未查到所需数据（Cache miss），CPU会到主存（MainMemory）中将所需数据取回到Cache中。Figure 1 shows a typical CPU memory structure. When the CPU needs to read data, it first searches in the Cache, and if the required data (Cache hit) already exists in the Cache, it returns the data directly to the CPU. If the CPU does not find the required data in the Cache (Cache miss), the CPU will go to the main memory (MainMemory) to retrieve the required data into the Cache.

Cache中有一组寄存器MSHR（Miss Status Handling Registers），专门用于记录已经发向内存的cache未命中的的读请求（即Cache miss的请求）的信息。MSHR记录的信息包括Cache Line地址、读请求的目的寄存器等。当主存完成读请求、返回该Cache Line的数据后，记录的信息就用于填充对应的Cache Line和返回数据到目的寄存器中。每个Cache Miss的读请求都要占用MSHR的一项。当MSHR被占满后，新的Cache Miss的读请求就会被停住，不能发至主存。因此，MSHR支持的未完成读请求（指读请求已经发送出去，但读请求的数据还未返回。这个读请求还没有完成，所以还要由MSHR记录该请求）的个数，是决定访存系统并发度的关键因素之一。There is a set of registers MSHR (Miss Status Handling Registers) in the Cache, which is specially used to record the information of the cache miss read request (that is, the request of Cache miss) that has been sent to the memory. The information recorded by MSHR includes Cache Line address, destination register of read request, etc. When the main memory completes the read request and returns the data of the Cache Line, the recorded information is used to fill the corresponding Cache Line and return the data to the destination register. Each read request of Cache Miss occupies an item of MSHR. When the MSHR is full, the read request of the new Cache Miss will be stopped and cannot be sent to the main memory. Therefore, the number of outstanding read requests supported by MSHR (meaning that the read request has been sent, but the data of the read request has not been returned. This read request has not been completed, so the MSHR needs to record the request) The number is to determine the memory access One of the key factors of system concurrency.

目前，比较典型的处理器的MSHR支持的未完成读请求的个数一般较少。例如Cortex-A9处理器，L2Cache的MSHR仅支持10个未完成的读请求。当应用程序向内存系统发出大量并发访存请求，并且这些请求局部性较低（因此会出现大量Cache Miss）时，MSHR就会迅速被占满，成为整个系统的瓶颈。At present, the number of outstanding read requests supported by the MSHR of a typical processor is generally small. For example, the Cortex-A9 processor, the MSHR of L2Cache only supports 10 outstanding read requests. When the application sends a large number of concurrent memory access requests to the memory system, and these requests have low locality (so there will be a large number of Cache Miss), MSHR will be quickly filled and become the bottleneck of the entire system.

图2所示为某型处理器的存储体系结构，该处理器提出了一种全新的访存方式，理论上可以支持大量并发访存请求的发送。Figure 2 shows the storage architecture of a certain type of processor. This processor proposes a new memory access method, which can theoretically support the sending of a large number of concurrent memory access requests.

该处理器由1个PPE（Power Processor Element）、8个SPE（SynergisticProcessing Element，协同处理单元）、1个MIC（Memory Interface Controller，内存接口控制器）、1个EIB（Element Interconnect Bus，单元互连总线）组成。The processor consists of 1 PPE (Power Processor Element), 8 SPE (Synergistic Processing Element, collaborative processing unit), 1 MIC (Memory Interface Controller, memory interface controller), 1 EIB (Element Interconnect Bus, unit interconnection bus) composition.

下面重点关注协同处理单元SPE的访存机制。The following focuses on the memory access mechanism of the collaborative processing unit SPE.

每个SPE都是一个微处理器，其程序可以运行在本地256KB的存储单元中（RAM）。当SPE需要从主存（Main Memory）中获取数据时，需要先初始化DMAC（Direct Memory Access Controller），将请求源地址、目的地址和请求长度等参数写入DMAC控制队列中。DMAC根据队列中的参数自行将数据从主存搬移到本地存储中。Each SPE is a microprocessor whose programs can run in a local 256KB storage unit (RAM). When the SPE needs to obtain data from the main memory (Main Memory), it needs to initialize the DMAC (Direct Memory Access Controller) first, and write the parameters such as the request source address, destination address and request length into the DMAC control queue. DMAC automatically moves data from main memory to local storage according to the parameters in the queue.

这种机制理论上支持的并发请求的个数仅受限于DMA命令队列中可存储的命令个数，或者是受限片上RAM的容量。但是，这种机制有两个缺陷：The number of concurrent requests supported by this mechanism in theory is only limited by the number of commands that can be stored in the DMA command queue, or the capacity of the limited on-chip RAM. However, this mechanism has two drawbacks:

1.由于每个DMA操作启动之前都需要先输入一些参数，如源地址、目标地址、数据大小、TAG标记和方向等，这个过程将占用若干个指令周期。如果SPE需要并发读取大量的小粒度数据时，DMA传输的效率是比较低的；1. Because some parameters need to be input before starting each DMA operation, such as source address, target address, data size, TAG mark and direction, etc., this process will take several instruction cycles. If SPE needs to read a large amount of small-grained data concurrently, the efficiency of DMA transfer is relatively low;

2.DMA状态管理的效率低。首先，程序需要为读请求的返回数据准备足够的空间，而该方案缺乏空闲空间管理机制，长时间运行后本地存储空间利用率将大大降低；其次，处理器获取DMA完成状态采用了软件轮询状态位的方式，当访存请求的个数增多时，效率不高。2. The efficiency of DMA state management is low. First of all, the program needs to prepare enough space for the returned data of the read request, and this solution lacks a free space management mechanism, and the utilization rate of the local storage space will be greatly reduced after a long run; secondly, the processor uses software polling to obtain the DMA completion status The way of status bits is not efficient when the number of memory access requests increases.

发明内容Contents of the invention

为了解决上述技术问题，本发明的目的在于提出一种基于片上RAM的高并发访存加速器及利用片上RAM管理大量并发访存请求的方法，解决通用处理器在互联网和云计算应用中并发访存个数受限的问题，加速高并发访存。In order to solve the above-mentioned technical problems, the object of the present invention is to propose a high concurrent memory access accelerator based on on-chip RAM and a method for managing a large number of concurrent memory access requests by using on-chip RAM, so as to solve the problem of concurrent memory access of general processors in Internet and cloud computing applications. For problems with limited numbers, speed up high concurrent memory access.

具体地讲，本发明公开了一种基于片上RAM的高并发访存加速器，该访存加速器独立于片上Cache和MSHR，与片上RAM和内存控制器相连，未完成访存请求通过该访存加速器发往内存控制器至内存系统。Specifically, the present invention discloses a highly concurrent memory access accelerator based on on-chip RAM. The memory access accelerator is independent of the on-chip Cache and MSHR, and is connected with the on-chip RAM and memory controller. Unfinished memory access requests pass through the memory access accelerator. Sent to the memory controller to the memory system.

所述基于片上RAM的高并发访存加速器，该访存加速器所支持的待完成访存请求的个数仅取决于片上RAM的容量，不受MSHR项数的限制。In the highly concurrent memory access accelerator based on on-chip RAM, the number of pending memory access requests supported by the memory access accelerator depends only on the capacity of the on-chip RAM and is not limited by the number of MSHR items.

所述基于片上RAM的高并发访存加速器，该可寻址空间中有一张读请求表，用于存放读请求的信息，该读请求表的每一项对应一个固有的id号。In the on-chip RAM-based high-concurrency memory access accelerator, there is a read request table in the addressable space for storing read request information, and each item of the read request table corresponds to an inherent id number.

所述基于片上RAM的高并发访存加速器，该读请求表的每一项有三个域，用于存放该读请求的类型、地址和数据，其中类型域和地址域由CPU填入，数据域由该访存加速器填入。Described high concurrency memory access accelerator based on on-chip RAM, each item of this read request table has three fields, is used to store the type, address and data of this read request, wherein type field and address field are filled in by CPU, data field Filled by this memory accelerator.

所述基于片上RAM的高并发访存加速器，该读请求表的数据域过大时，可以只存放数据指针，数据指针指向返回数据的存放地址，返回数据的存放地址由CPU进行分配。For the high concurrent memory access accelerator based on the on-chip RAM, when the data field of the read request table is too large, only the data pointer can be stored, and the data pointer points to the storage address of the returned data, which is allocated by the CPU.

所述基于片上RAM的高并发访存加速器，该读请求表的每一项均为三种状态：空闲、新的读请求和已经完成读请求，初始状态为空闲，CPU有访存请求时，将请求填入，状态变为新的读请求，访存加速器将该请求发往内存控制器，返回数据后将数据填入数据域，状态变为已经完成读请求，CPU从数据域中取数据并进行处理，处理完成后状态回到空闲。Described high concurrency memory access accelerator based on on-chip RAM, each item of this read request table is three states: idle, new read request and completed read request, initial state is idle, when CPU has memory access request, Fill in the request, the state becomes a new read request, the memory access accelerator sends the request to the memory controller, returns the data and fills the data into the data field, the state becomes the read request completed, and the CPU fetches the data from the data field And process it, and the state returns to idle after the process is completed.

所述基于片上RAM的高并发访存加速器，每个循环队列都包括一个头指针和一个尾指针，空闲队列的头指针和尾指针及已完成队列的头指针是软件中的变量，由CPU负责维护；新读请求队列的头指针、尾指针和已完成队列的尾指针是硬件寄存器，新读请求队列的头指针由访存加速器负责维护；新读请求队列的尾指针由CPU和访存加速器共同维护，CPU只写，访存加速器只读；已完成队列的尾指针由CPU和访存加速器共同维护，CPU只读，访存加速器只写。Described high concurrency memory access accelerator based on on-chip RAM, each circular queue includes a head pointer and a tail pointer, the head pointer and tail pointer of idle queue and the head pointer of completed queue are variables in the software, responsible for by CPU Maintenance; the head pointer and tail pointer of the new read request queue and the tail pointer of the completed queue are hardware registers, and the head pointer of the new read request queue is maintained by the memory access accelerator; the tail pointer of the new read request queue is controlled by the CPU and the memory access accelerator Jointly maintained, the CPU is write-only, and the memory access accelerator is only read; the tail pointer of the completed queue is jointly maintained by the CPU and the memory access accelerator, the CPU is read-only, and the memory access accelerator is only write.

本发明还公开一种基于片上RAM的高并发访存方法，包括设置一独立于片上Cache和MSHR的访存加速器，该访存加速器与片上RAM和内存控制器相连，未完成访存请求通过该访存加速器发往内存控制器至内存系统。The invention also discloses a high concurrent memory access method based on on-chip RAM, which includes setting a memory access accelerator independent of the on-chip Cache and MSHR, the memory access accelerator is connected with the on-chip RAM and the memory controller, and unfinished memory access requests pass through the The memory access accelerator is sent to the memory controller to the memory system.

所述基于片上RAM的高并发访存方法，片上CPU将访存请求写入片上RAM的可寻址空间，该访存加速器读取请求执行，对于读请求，待数据从内存系统返回后，该访存加速器将数据放入该空间并通知CPU，CPU对数据进行处理。The high concurrent memory access method based on the on-chip RAM, the on-chip CPU writes the memory access request into the addressable space of the on-chip RAM, and the memory access accelerator reads the request and executes it. For the read request, after the data is returned from the memory system, the The memory access accelerator puts data into this space and notifies the CPU, and the CPU processes the data.

所述基于片上RAM的高并发访存方法，该可寻址空间中有一张保存读请求的读请求表，用于存放读请求的信息，该读请求表的每一项对应一个固有的id号。The high concurrent memory access method based on on-chip RAM, there is a read request table for storing read requests in the addressable space, which is used to store the information of read requests, and each item of the read request table corresponds to an inherent id number .

所述基于片上RAM的高并发访存方法，该读请求表的每一项有三个域，用于存放该读请求的类型、地址和数据，其中类型域和地址域由CPU填入，数据域由该访存加速器填入。Described high concurrency memory access method based on on-chip RAM, each item of this read request table has three fields, is used to deposit the type, address and data of this read request, wherein type field and address field are filled in by CPU, data field Filled by this memory accelerator.

所述基于片上RAM的高并发访存方法，该读请求表的数据域过大时，可以只存放数据指针,数据指针指向返回数据的存放地址，返回数据的存放地址由CPU进行分配。In the high concurrent memory access method based on on-chip RAM, when the data field of the read request table is too large, only the data pointer can be stored, and the data pointer points to the storage address of the returned data, and the storage address of the returned data is allocated by the CPU.

所述基于片上RAM的高并发访存方法，该读请求表的每一项均为三种状态：空闲、新的读请求和已经完成读请求，初始状态为空闲，CPU有访存请求时，将请求填入，状态变为新的读请求，访存加速器将该请求发往内存控制器，返回数据后将数据填入数据域，状态变为已经完成读请求，CPU从数据域中取数据并进行处理，处理完成后状态回到空闲。Described high concurrent memory access method based on on-chip RAM, each item of this read request table is three states: idle, new read request and finished read request, initial state is idle, when CPU has memory access request, Fill in the request, the state becomes a new read request, the memory access accelerator sends the request to the memory controller, returns the data and fills the data into the data field, the state becomes the read request completed, and the CPU fetches the data from the data field And process it, and the state returns to idle after the process is completed.

本发明还公开一种基于片上RAM的高并发访存方法，包括CPU发起读请求的步骤：The present invention also discloses a highly concurrent memory access method based on on-chip RAM, including the steps of CPU initiating a read request:

步骤S701，CPU查询片上RAM可寻址空间中的空闲队列状态，判断空闲队列是否为空，CPU判断空闲队列为空的条件是：空闲队列的头指针与尾指针重合，若空，则返回，若非空，则转至步骤S702。Step S701, the CPU queries the state of the idle queue in the on-chip RAM addressable space, and judges whether the idle queue is empty. The condition for the CPU to judge that the idle queue is empty is: the head pointer of the idle queue coincides with the tail pointer, and if it is empty, returns, If not empty, go to step S702.

步骤S702，CPU从空闲队列队首取id；Step S702, the CPU fetches the id from the idle queue first;

步骤S703，CPU填写与该id对应的读请求表项的类型域和地址域；Step S703, the CPU fills in the type field and address field of the read request entry corresponding to the id;

步骤S704，CPU将该id写入新读请求队列的队尾；Step S704, the CPU writes the id to the tail of the new read request queue;

步骤S705，CPU将更新的新读请求队列队尾指针传给访存加速器；Step S705, the CPU transmits the updated tail pointer of the new read request queue to the memory access accelerator;

步骤S706，CPU判断是否继续发起读请求，若是，转至步骤S701，若否，返回。In step S706, the CPU judges whether to continue to initiate a read request, if yes, go to step S701, if not, return.

本发明还公开一种基于片上RAM的高并发访存方法，包括CPU处理读请求返回数据步骤：The present invention also discloses a highly concurrent memory access method based on on-chip RAM, including the steps of CPU processing read request and returning data:

步骤S801，CPU查询已完成队列的状态，判断已完成队列是否为空（CPU判断已完成队列为空的条件是：已完成队列的头指针与尾指针重合），若空，则返回；若非空，则转至步骤S802；Step S801, the CPU queries the status of the completed queue, and judges whether the completed queue is empty (the condition for the CPU to judge that the completed queue is empty is: the head pointer of the completed queue coincides with the tail pointer), if empty, return; if not empty , then go to step S802;

步骤S802，CPU从已完成队列的队首取id；Step S802, the CPU fetches the id from the leader of the completed queue;

步骤S803，CPU操作与该id对应的读请求表项的数据域；Step S803, the CPU operates the data field of the read request entry corresponding to the id;

步骤S804，CPU将该id写入空闲队列的队尾；Step S804, the CPU writes the id to the tail of the idle queue;

步骤S805，CPU判断是否继续操作，若是，则转至步骤S801，若否，则返回。In step S805, the CPU judges whether to continue the operation, if yes, go to step S801, if not, return.

本发明还公开一种基于片上RAM的高并发访存方法，其特征在于，包括访存加速器处理读请求的步骤：The present invention also discloses a highly concurrent memory access method based on on-chip RAM, which is characterized in that it includes the steps of processing a read request by a memory access accelerator:

步骤S901，访存加速器实时查询新请求队列是否为空，若非空，则转至步骤S902，若空，则一直在此步骤查询；Step S901, the memory access accelerator inquires in real time whether the new request queue is empty, if not empty, then go to step S902, if empty, continue to inquire at this step;

步骤S902，访存加速器从新读请求队列的队首取id；Step S902, the memory access accelerator fetches the id from the leader of the new read request queue;

步骤S903，访存加速器取出与该id对应的读请求表项的类型域和地址域；Step S903, the memory access accelerator fetches the type field and address field of the read request entry corresponding to the id;

步骤S904，访存加速器从内存中取回数据，写入到与该id对应的读请求表项的数据域；Step S904, the memory access accelerator retrieves data from the memory, and writes it into the data field of the read request entry corresponding to the id;

步骤S905，访存加速器将该id写到已完成队列的队尾。Step S905, the memory access accelerator writes the id to the end of the completed queue.

本发明还公开一种基于片上RAM的高并发访存方法，包括：The invention also discloses a highly concurrent memory access method based on on-chip RAM, including:

步骤1，CPU发起写请求时，先检查写循环队列是否已满，若不满，则填入写请求的类型、地址和写数据；Step 1, when the CPU initiates a write request, first check whether the write circular queue is full, if not, fill in the type, address and write data of the write request;

步骤2，访存加速器检测到写循环队列非空，则自动从写循环队列头指针处读取写请求的类型、地址和数据；Step 2, the memory access accelerator detects that the write circular queue is not empty, and then automatically reads the type, address and data of the write request from the write circular queue head pointer;

步骤3，访存加速器将写请求发给内存控制器。Step 3, the memory access accelerator sends the write request to the memory controller.

本发明还公开一种采用权利要求1-17中任一项高并发访存方法或者高并发访存器的处理器。The present invention also discloses a processor adopting any one of the high-concurrency memory access methods or high-concurrency memory access devices in claims 1-17.

本发明的技术效果：Technical effect of the present invention:

1、利用片上RAM保存1个或多个读请求表（Read table），读请求表每一项的内容包括请求类型域、目标地址域和数据域等所有读请求的必要信息，由于本发明使用片上RAM记录并发请求的所有信息，并发请求的数量仅受限于片上RAM的大小。1. Utilize the on-chip RAM to save one or more read request tables (Read table), the content of each item of the read request table includes the necessary information of all read requests such as the request type field, the target address field and the data field, because the present invention uses On-chip RAM records all information of concurrent requests, and the number of concurrent requests is limited only by the size of on-chip RAM.

2、将读请求表的每一项按请求状态分成3类：空闲类、新请求类和完成类，并且每一类请求项的入口地址分别使用循环队列进行存储，便于对读请求的状态进行管理。本发明使用循环队列管理大量的读、写请求信息，避免轮询多个请求的状态位，查询的次数大大减少，从而对大量并发且不相关的小粒度访存请求具有明显的加速效果。2. Divide each item in the read request table into three categories according to the request status: idle, new request and completion, and the entry address of each type of request is stored in a circular queue to facilitate the status of the read request manage. The invention uses a circular queue to manage a large amount of read and write request information, avoids polling status bits of multiple requests, greatly reduces the number of queries, and thus has an obvious acceleration effect on a large number of concurrent and irrelevant small-grained memory access requests.

3、通过“新请求类”的循环队列的非空状态判断是否发起访存操作，通过读取“新请求类”循环队列的内容，获取读请求在片上RAM存储的地址，如此，访存加速器可以自行乱序发起访存操作，不需要软件进行控制，从而支持访存请求的乱序执行，乱序返回，方便对大量访存请求实现有针对性的调度。3. Judging whether to initiate a memory access operation based on the non-empty state of the circular queue of the "new request class", and obtaining the address of the read request stored in the on-chip RAM by reading the contents of the "new request class" circular queue, so that the memory access accelerator It can initiate out-of-order memory access operations without software control, thus supporting out-of-order execution and out-of-order return of memory access requests, which facilitates targeted scheduling of a large number of memory access requests.

4、CPU软件通过“完成类”循环队列的非空状态判断是否有读请求完成，CPU通过读取“完成类”循环队列的内容，获取读请求返回数据在片上RAM中的地址，避免CPU轮询多个请求的状态位，提高软件查找效率。4. The CPU software judges whether there is a read request completed through the non-empty status of the "completion class" circular queue. The CPU reads the content of the "completion class" circular queue to obtain the address of the data returned by the read request in the on-chip RAM, avoiding CPU rounds. Query the status bits of multiple requests to improve the efficiency of software search.

附图说明Description of drawings

图1所示为现有典型的CPU存储结构；Figure 1 shows the existing typical CPU storage structure;

图2所示为某型处理器的存储体系结构；Figure 2 shows the storage architecture of a certain type of processor;

图3所示为本发明的访存加速器在处理器上的位置图；Fig. 3 shows the location diagram of the memory access accelerator of the present invention on the processor;

图4所示为本发明中的可寻址空间中的Read table；Figure 4 shows the Read table in the addressable space among the present invention;

图5所示为本发明中的Read table中每一项的状态变迁；Fig. 5 shows the state transition of each item in the Read table among the present invention;

图6所示为本发明中使用三个循环队列管理读请求的状态；Figure 6 shows the status of using three circular queue management read requests in the present invention;

图7所示为本发明中CPU发起读请求的步骤；Fig. 7 shows the step that CPU initiates read request among the present invention;

图8所示为本发明中CPU处理读请求返回数据的步骤；Figure 8 shows the steps of CPU processing read request return data in the present invention;

图9所示为本发明中访存加速器处理读请求的步骤；Figure 9 shows the steps of the memory access accelerator processing the read request in the present invention;

图10所示为本发明中的使用一个循环队列管理写请求。Figure 10 shows the use of a circular queue to manage write requests in the present invention.

具体实施方式Detailed ways

本发明针对通用处理器并发访存请求个数受限的问题，提出“访存加速器”的概念。访存加速器是CPU和内存Memory之间的另一条通路。The invention proposes the concept of "memory access accelerator" aiming at the problem that the number of concurrent memory access requests of general processors is limited. The memory access accelerator is another path between the CPU and the memory.

图3所示为本发明的访存加速器在处理器上的位置图，它绕过高速缓存Cache和MSHR，所支持的未完成读请求的个数比MSHR多至少一个数量级。因此，通过访存加速器，应用程序能将更多的访存请求发往内存系统，从而提高访存的并发性。处理器包括CPU1、RAM3、访存加速器4、Cache2、MSHR3、内存控制器6、内存7。Fig. 3 shows the location map of the memory access accelerator of the present invention on the processor, it bypasses the cache cache and MSHR, and the number of outstanding read requests supported is at least an order of magnitude larger than MSHR. Therefore, through the memory access accelerator, the application can send more memory access requests to the memory system, thereby improving the concurrency of memory access. The processor includes CPU1, RAM3, memory access accelerator 4, Cache2, MSHR3, memory controller 6, andmemory 7.

访存加速器需要CPU拥有一块片上可寻址的RAM空间，CPU将访存请求写入该RAM空间，访存加速器读取请求执行。如果是读请求，待数据从Memory返回后，访存加速器将数据放入空间并通知CPU，然后CPU对数据进行处理。The memory access accelerator requires the CPU to have an on-chip addressable RAM space. The CPU writes the memory access request into the RAM space, and the memory access accelerator reads and executes the request. If it is a read request, after the data is returned from the memory, the memory access accelerator puts the data into the space and notifies the CPU, and then the CPU processes the data.

图4所示为本发明中的可寻址空间中的读请求表（Read table），可寻址空间中要有一张保存读请求的读请求表，称为Read table。Figure 4 shows the read request table (Read table) in the addressable space in the present invention, there must be a read request table for storing read requests in the addressable space, called Read table.

Read table的每一项对应一个固有的id号，Read table项中可以存放读请求的信息。每一项有三个域：type、addr、data，分别用于存放该读请求的类型、地址和数据。type域用于编码需要的附加信息，如请求的数据长度、请求的优先级、是否是一个分离/聚合（scatter/gather）类型的读请求等等。使用type域，再加上辅助的硬件，就可以实现一些当前体系结构不支持的高级访存功能。type域和addr域由CPU填入，data域由访存加速器填入。Each item of the Read table corresponds to an inherent id number, and the information of the read request can be stored in the Read table item. Each item has three fields: type, addr, and data, which are used to store the type, address, and data of the read request respectively. The type field is used to encode additional information required, such as the length of the requested data, the priority of the request, whether it is a read request of the scatter/gather type, and so on. Using the type field, coupled with auxiliary hardware, some advanced memory access functions that are not supported by the current architecture can be implemented. The type field and addr field are filled by the CPU, and the data field is filled by the memory access accelerator.

Read table中的每一项都可以分为三种状态：空闲、新的读请求（尚未发送给内存控制器）、已经完成的读请求（内存控制器已经返回该请求的数据，并且数据已填入data域）。Each item in the Read table can be divided into three states: idle, new read request (not yet sent to the memory controller), completed read request (the memory controller has returned the requested data, and the data has been filled into the data field).

图5所示为本发明中的Read table中每一项的状态变迁，对于Read table中的一项，初始状态为空闲free；CPU有访存请求时，将请求填入该项，该项状态就变为新读请求new read；访存加速器将该请求发往内存控制器，返回数据后将数据填入该项的data域，该项的状态就变为完成度请求finishedread；CPU从data域中取数据并进行处理，处理完成后该项的状态就回到了空闲free。Fig. 5 shows the state transition of each item in the Read table among the present invention, and for an item in the Read table, the initial state is idle free; When the CPU has a memory access request, the request is filled in this item, and the item status It becomes a new read request new read; the memory access accelerator sends the request to the memory controller, returns the data and fills the data into the data field of the item, and the status of the item becomes finished read; the CPU reads from the data field The data is fetched and processed. After the processing is completed, the status of the item returns to free.

在上述过程中，有三个关键的问题需要解决：In the above process, there are three key issues that need to be resolved:

1.CPU发送请求时如何找到Read table中的空闲项的位置？1. How to find the position of the free item in the Read table when the CPU sends a request?

2.访存加速器如何找到新请求项的位置？2. How does the memory access accelerator find the location of the new request item?

3.CPU如何获取读请求返回项的位置？3. How does the CPU obtain the position of the item returned by the read request?

为此，本发明提出了一种基于循环队列的请求管理方法。For this reason, the present invention proposes a request management method based on a circular queue.

图6所示为本发明中使用三个循环队列管理读请求的状态，三个循环队列为：空闲循环队列（free entry queue），新读请求循环队列（new read queue）和完成读请求循环队列（finished read queue），分别用于存储Read table中的空闲项、新读请求项和已完成的读请求项的id。这三个循环队列都在可寻址空间中。每个队列有两个指针：队头head和队尾tail，分别用于指示队列头和队列尾的位置。图中，A是返回，CPU可以重新决定是继续发起读请求还是可以开始处理数据。未返回前，是不允许其他操作插入的，返回后，才可以重新发起操作。Figure 6 shows the state of using three circular queues to manage read requests in the present invention. The three circular queues are: an idle circular queue (free entry queue), a new read request circular queue (new read queue) and a complete read request circular queue (finished read queue), which are used to store the ids of idle items, new read request items and completed read request items in the Read table respectively. All three circular queues are in addressable space. Each queue has two pointers: the head of the queue and the tail of the queue, which are used to indicate the positions of the head of the queue and the tail of the queue respectively. In the figure, A is a return, and the CPU can re-determine whether to continue to initiate a read request or start processing data. Before returning, other operations are not allowed to be inserted, and after returning, the operation can be re-initiated.

举例说明CPU使用访存加速器发起读请求的操作的过程：Here is an example to illustrate the process of the CPU using the memory access accelerator to initiate a read request:

1.当CPU需要读取内存数据时，先查询free entry queue是否为空。若为空，则说明图4中的Read table被完全占满，暂时还没有空闲的Read table项可用；若非空，则说明Read table中还有空闲项可用。如图6所示，判断free entry queue是否为空的条件是：指针head1与指针tail1重合。1. When the CPU needs to read memory data, first check whether the free entry queue is empty. If it is empty, it means that the Read table in Figure 4 is completely occupied, and there is no free Read table item available for the time being; if it is not empty, it means that there are still free items available in the Read table. As shown in Figure 6, the condition for judging whether the free entry queue is empty is: the pointer head1 coincides with the pointer tail1.

2.CPU从free entry queue的队首取出一个id号，找到该id号对应的Read table项的地址，将新请求的type和addr域填入Read table项。同时，CPU将该id号存放到new read queue的队尾。2. The CPU takes out an id number from the head of the free entry queue, finds the address of the Read table item corresponding to the id number, and fills the type and addr fields of the new request into the Read table item. At the same time, the CPU stores the id number at the end of the new read queue.

CPU对循环队列的操作过程如图6中虚线1所示，此操作完成后，id3将被从head1位置挪到tail2的位置。The operation process of the CPU on the circular queue is shown by the dottedline 1 in Figure 6. After this operation is completed, id3 will be moved from the position of head1 to the position of tail2.

3.CPU将tail2指针向后移一位，并将新的tail2指针发送给访存加速器。3. The CPU moves the tail2 pointer back one bit, and sends the new tail2 pointer to the memory access accelerator.

4.访存加速器通过比较head2和tail2指针来判断new read queue是否为空。当访存加速器检测到new read queue非空时，则自动从new read queue队首取出新的id，通过id找到Read table中对应的未完成的读请求项进行处理，将请求的数据返回到Read table的data域中。处理完成后，将该id写入finished read queue的队尾。4. The memory access accelerator judges whether the new read queue is empty by comparing the head2 and tail2 pointers. When the memory access accelerator detects that the new read queue is not empty, it automatically takes out a new id from the head of the new read queue, finds the corresponding unfinished read request item in the Read table through the id, and returns the requested data to Read In the data domain of the table. After the processing is completed, write the id to the tail of the finished read queue.

访存加速器对循环队列的操作过程如图6中虚线2所示，此操作完成后，id9将被从head2位置挪到tail3的位置。The operation process of the memory access accelerator on the circular queue is shown by the dottedline 2 in Figure 6. After this operation is completed, id9 will be moved from the position of head2 to the position of tail3.

5.CPU需要处理数据时，先检查finished read queue是否为空。检查方法仍然是对比head和tail指针。若finished read queue非空，则取出一个id号，找到该id对应的Read table项，对该项的data域进行处理。处理完成后，将该id写入到new read queue的队尾。5. When the CPU needs to process data, first check whether the finished read queue is empty. The checking method is still to compare the head and tail pointers. If the finished read queue is not empty, take out an id number, find the Read table item corresponding to the id, and process the data field of the item. After the processing is completed, write the id to the tail of the new read queue.

CPU对循环队列的操作过程如图6中虚线3所示，此操作完成后，id2将被从head3位置挪到tail1的位置。The operation process of the CPU on the circular queue is shown by the dottedline 3 in Figure 6. After this operation is completed, id2 will be moved from the position of head3 to the position of tail1.

6.以上过程可以重复进行。6. The above process can be repeated.

写请求的处理相对简单。由于写请求不用返回数据，访存加速器只需将CPU发下来的写请求传递给内存控制器即可，因此其管理结构可大大简化。The processing of write requests is relatively simple. Since the write request does not need to return data, the memory access accelerator only needs to pass the write request from the CPU to the memory controller, so its management structure can be greatly simplified.

图10所示为本发明中的使用一个循环队列管理写请求，只需使用一个写循环队列（write queue）即可管理写请求。这里将写请求的type、addr和data域直接放入队列中。与上边的三个队列queue一样，write queue也需要在可寻址空间中。Figure 10 shows the use of a circular queue to manage write requests in the present invention, and only one write queue (write queue) can be used to manage write requests. Here, the type, addr and data fields of the write request are directly put into the queue. Like the three queue queues above, the write queue also needs to be in the addressable space.

写循环队列(write queue)的使用方式如下：The write queue is used as follows:

CPU需要发新的写请求时，先检查write queue是否已满(发写请求时，需要先确定write queue里是否还有空间可以暂存数据。满就表示ram上没有空间了，就不能再发送写请求了。若不满，则向tail4指示的位置填入写请求的类型、地址和写数据。When the CPU needs to send a new write request, first check whether the write queue is full (when sending a write request, you need to determine whether there is still space in the write queue to temporarily store data. Full means that there is no space on the ram, and no more sending Write request. If not satisfied, fill in the type, address and write data of the write request to the position indicated by tail4.

访存加速器检测到write queue非空（说明有数据，说明还有写请求未完成，访存加速器自动将请求取出执行），则自动从head4指针处读取写请求的type、addr和data，将写请求发给内存控制器。The memory access accelerator detects that the write queue is not empty (indicating that there is data, indicating that there is still a write request that has not been completed, and the memory access accelerator automatically takes out and executes the request), then automatically reads the type, addr, and data of the write request from the head4 pointer, and Write requests are sent to the memory controller.

综上，本发明公开了一种基于片上RAM的高并发访存加速器，该访存加速器独立于片上Cache和MSHR，与片上RAM和内存控制器相连，未完成访存请求通过该访存加速器发往内存控制器至内存系统。In summary, the present invention discloses a highly concurrent memory access accelerator based on on-chip RAM. The memory access accelerator is independent of the on-chip Cache and MSHR, and is connected to the on-chip RAM and the memory controller. Unfinished memory access requests are sent through the memory access accelerator. To the memory controller to the memory system.

该访存加速器所支持的该未完成访存请求的个数仅取决于片上RAM的容量，不受MSHR项数的限制。该片上RAM为具有可寻址空间的RAM，片上CPU将访存请求写入该可寻址空间，访存加速器读取请求执行，对于读请求，待数据从内存系统返回后，该访存加速器将数据放入该可寻址空间并通知CPU，然后CPU对数据进行处理。The number of outstanding memory access requests supported by the memory access accelerator only depends on the capacity of the on-chip RAM, and is not limited by the number of MSHR items. The on-chip RAM is RAM with an addressable space. The on-chip CPU writes the memory access request into the addressable space, and the memory access accelerator reads the request. For the read request, after the data is returned from the memory system, the memory access accelerator Put data into that addressable space and notify the CPU, which then does something with the data.

该片上RAM为片上CPU的RAM，或者独立于片上CPU。The on-chip RAM is RAM of the on-chip CPU, or is independent of the on-chip CPU.

该可寻址空间中有一张读请求表，用于存放读请求的信息，该读请求表的每一项对应一个固有的id号。There is a read request table in the addressable space for storing the information of the read request, and each item of the read request table corresponds to an inherent id number.

该读请求表的每一项有三个域，用于存放该读请求的类型、地址和数据，其中类型域和地址域由CPU填入，数据域由该访存加速器填入。Each item of the read request table has three fields for storing the type, address and data of the read request, wherein the type field and address field are filled by the CPU, and the data field is filled by the memory access accelerator.

该读请求表的每一项均为三种状态：空闲、新的读请求和已经完成读请求，初始状态为空闲，CPU有访存请求时，将请求填入，状态变为新的读请求，访存加速器将该请求发往内存控制器，返回数据后将数据填入数据域，状态变为已经完成读请求，CPU从数据域中取数据并进行处理，处理完成后状态回到空闲。Each item in the read request table has three states: idle, new read request, and completed read request. The initial state is idle. When the CPU has a memory access request, the request is filled in, and the state becomes a new read request. , the memory access accelerator sends the request to the memory controller, returns the data and fills the data into the data field, the state changes to the read request completed, the CPU fetches the data from the data field and processes it, and the state returns to idle after the processing is completed.

该三种状态通过三个循环队列进行管理，每个循环队列包括队列头和队列尾的位置指针。The three states are managed through three circular queues, and each circular queue includes position pointers for the head of the queue and the tail of the queue.

本发明还公开了一种基于片上RAM的高并发访存方法，包括设置一独立于片上Cache和MSHR的访存加速器，该访存加速器与片上RAM和内存控制器相连，未完成访存请求通过该访存加速器发往内存控制器至内存系统。The invention also discloses a high-concurrency memory access method based on on-chip RAM, including setting a memory access accelerator independent of on-chip Cache and MSHR, the memory access accelerator is connected with on-chip RAM and memory controller, and unfinished memory access requests pass through The access accelerator is sent to the memory controller to the memory system.

本发明还公开了一种基于片上RAM的高并发访存方法，其特征在于，包括CPU发起读请求的步骤：The present invention also discloses a high concurrent memory access method based on on-chip RAM, which is characterized in that it includes the step of CPU initiating a read request:

步骤S701，CPU查询片上RAM可寻址空间中的空闲队列状态，判断空闲队列是否为空，若空，则返回，若非空，则转至步骤S702。CPU判断空闲队列为空的条件是：空闲队列的头指针与尾指针重合。In step S701, the CPU queries the state of the idle queue in the addressable space of the on-chip RAM, and judges whether the idle queue is empty, and returns if it is empty, or goes to step S702 if it is not empty. The condition for the CPU to judge that the idle queue is empty is: the head pointer of the idle queue coincides with the tail pointer.

本发明还公开了一种基于片上RAM的高并发访存方法，其特征在于，包括CPU处理读请求返回数据步骤：The present invention also discloses a high-concurrency memory access method based on on-chip RAM, which is characterized in that it includes the steps of CPU processing read request and returning data:

步骤S801，CPU查询已完成队列的状态，判断已完成队列是否为空，若空，则返回；若非空，则转至步骤S802；访存加速器判断已完成队列为空的条件是：已完成队列的头指针与尾指针重合。Step S801, the CPU queries the status of the completed queue, and judges whether the completed queue is empty, and returns if it is empty; if not empty, then goes to step S802; the condition for the memory access accelerator to judge that the completed queue is empty is: the completed queue The head pointer coincides with the tail pointer.

本发明还公开了一种基于片上RAM的高并发访存方法，其特征在于，包括访存加速器处理读请求的步骤：The present invention also discloses a highly concurrent memory access method based on on-chip RAM, which is characterized in that it includes the steps of processing a read request by a memory access accelerator:

所述基于片上RAM的高并发访存方法，判断空闲队列是否为空的条件是：队列头指针与队列尾指针重合。In the on-chip RAM-based high-concurrency memory access method, the condition for judging whether the idle queue is empty is that the queue head pointer coincides with the queue tail pointer.

该访存加速器对大量并发的读请求可以乱序发送，乱序返回。The memory access accelerator can send and return a large number of concurrent read requests out of order.

本发明还公开了一种基于片上RAM的高并发访存方法，包括：The invention also discloses a high concurrent memory access method based on on-chip RAM, including:

本发明还公开了一种采用上述访存方法及访存器的处理器。The invention also discloses a processor adopting the memory access method and the memory access device.

本发明具有如下特点：The present invention has following characteristics:

1、访存粒度灵活：访存粒度信息编码在type域中。访存粒度不受指令集和Cache Line的限制。访存的每一个数据都是软件所需要的，主存带宽的有效利用率得到提高。1. Flexible storage access granularity: the information of the storage access granularity is encoded in the type field. The memory access granularity is not limited by the instruction set and Cache Line. Every data accessed is required by the software, and the effective utilization rate of the main memory bandwidth is improved.

2、可实现一些高级访存功能：通过在type域指定访存类型，然后访存加速器解析执行，可实现诸如scatter/gather、链表读写等高级的访存操作。2. Some advanced memory access functions can be realized: by specifying the memory access type in the type field, and then the memory access accelerator parses and executes, advanced memory access operations such as scatter/gather, linked list read and write, etc. can be realized.

3、type域可携带一些上层信息，如线程号、优先级等，使得访存加速器可做一些高级的QoS调度。3. The type field can carry some upper-layer information, such as thread number, priority, etc., so that the memory access accelerator can do some advanced QoS scheduling.

4、可寻址空间使用SRAM才能更好地发挥加速器的作用。该设计中，CPU和访存加速器需要对Read table、队列和队列指针读写几次才能完成一个请求，因此可寻址空间的读写速度必须足够快，才能起到加速作用。SRAM比DRAM访问速度快很多，适合用在这里。4. The addressable space uses SRAM to better play the role of the accelerator. In this design, the CPU and the memory access accelerator need to read and write the Read table, queue, and queue pointer several times to complete a request. Therefore, the read and write speed of the addressable space must be fast enough to accelerate. SRAM is much faster than DRAM access speed, suitable for use here.

本发明的技术效果：Technical effect of the present invention: