CN102214158B

Movatterモバイル変換

Info

Publication number: CN102214158B
Application number: CN 201110152238
Authority: CN
Inventors: 朱敏; 刘雷波; 王延升; 戚斌; 杨军; 曹鹏; 时龙兴; 尹首一; 魏少军
Original assignee: Tsinghua University
Current assignee: Shenzhen Ziguang Tongchuang Electronics Co ltd
Priority date: 2011-06-08
Filing date: 2011-06-08
Publication date: 2013-05-22
Anticipated expiration: 2031-06-08
Also published as: CN102214158A

Abstract

Description

Translated fromChinese

一种全互联路由结构动态可重构处理器A Dynamically Reconfigurable Processor with Fully Interconnected Routing Structure

技术领域technical field

本发明涉及动态可重构处理器领域，特别涉及一种全互联路由结构动态可重构处理器。 The invention relates to the field of dynamically reconfigurable processors, in particular to a dynamic reconfigurable processor with a fully interconnected routing structure. the

背景技术Background technique

可重构计算是一种将软件的灵活性和硬件的高效性结合在一起的计算方式，比如现场可编程逻辑门阵列(Field Programmable Gate Array，FPGA)就是一个可重构计算应用的具体实例。和普通微处理器之间的区别在于它不仅可以改变控制流，还可以改变数据通路(Data Path)的结构，具有高性能、低硬件开销和功耗、灵活性好、扩展性好的优点。目前主要应用于媒体处理、模式识别、基带处理等计算密集型的算法。随着嵌入式处理器普遍要求缩短设计周期、降低设计和开发成本，另外最终市场和技术的不确定性越来越大，可重构处理逐步成为嵌入式处理器国际发展的趋势。不仅如此，在很多高性能计算的领域它也有所涉足，包括结构分析、计算流体力学、分子模拟、生物信息、计算化学、地震地质(油气勘探)、数值气象、宇宙学研究等。 Reconfigurable computing is a computing method that combines the flexibility of software and the efficiency of hardware. For example, Field Programmable Gate Array (Field Programmable Gate Array, FPGA) is a specific example of reconfigurable computing applications. The difference between it and ordinary microprocessors is that it can not only change the control flow, but also change the structure of the data path (Data Path), which has the advantages of high performance, low hardware overhead and power consumption, good flexibility and good scalability. At present, it is mainly used in computing-intensive algorithms such as media processing, pattern recognition, and baseband processing. As embedded processors generally require shortening the design cycle, reducing design and development costs, and the uncertainty of the final market and technology is increasing, reconfigurable processing has gradually become the international development trend of embedded processors. Not only that, it has also been involved in many high-performance computing fields, including structural analysis, computational fluid dynamics, molecular simulation, bioinformatics, computational chemistry, seismic geology (oil and gas exploration), numerical meteorology, and cosmology research. the

新的半导体工艺为可重构硬件带来千万门级电路的技术，从而为可重构硬件提供足够的面积；在速度上，可重构硬件的性能正在接近专用定制芯片。在这些变化影响下，可重构计算在技术路线上逐步走上动态重构、粗颗粒度并行硬件、异构多核的道路。例如欧洲微电子中心(IMEC)的ADRES处理器由紧耦合的超长指令字(Very Long Instruction Word，VLIW)处理器内核和粗颗粒度并行矩阵计算的可重构硬件构成。而惠普(HP)的CHESS处理器则由大量可重构算术可重构阵列模块构成。 The new semiconductor process brings tens of millions of gate-level circuit technology to reconfigurable hardware, thereby providing sufficient area for reconfigurable hardware; in terms of speed, the performance of reconfigurable hardware is approaching dedicated custom chips. Under the influence of these changes, reconfigurable computing has gradually embarked on the road of dynamic reconfiguration, coarse-grained parallel hardware, and heterogeneous multi-core on the technical route. For example, the ADRES processor of the European Microelectronics Center (IMEC) is composed of a tightly coupled Very Long Instruction Word (VLIW) processor core and reconfigurable hardware for coarse-grained parallel matrix computing. And Hewlett-Packard (HP)'s CHESS processor is made up of a large number of reconfigurable arithmetic reconfigurable array modules. the

可重构处理器的基本组成包括主控制器和可重构运算单元。可重构运算单元均采用阵列的形式(阵列是并行化硬件的基本形式)，来加大处理能力，同时通过灵活的互联结果来保证阵列的通用性。运算单元之间动态可重构的互联是动态可重构处理器芯片实现的关键技术之一。 The basic composition of a reconfigurable processor includes a main controller and a reconfigurable computing unit. The reconfigurable computing units are all in the form of arrays (arrays are the basic form of parallel hardware) to increase processing capabilities, and at the same time ensure the versatility of the arrays through flexible interconnection results. Dynamically reconfigurable interconnection between computing units is one of the key technologies for the realization of dynamically reconfigurable processor chips. the

互联结构需要保证阵列灵活性的同时，考虑外部连接开关带宽，提高芯片的计算吞吐量。现有的互联结构，有的采取FPGA中connecting-box和switching-box的结构，非常灵活，但问题在于配置点过多，重构的信息量过大，导致无法动态完成，降低面积利用效率，限制了应用范围(比如嵌入式)。动态可重构阵列模块中也有使用NoC上的mesh全互联结构和传输协议包的形式：Mesh全互联面积代价巨大，同样大小的芯片面积上可容纳的计算单元数量减少，故也存在面积效率不高的问题，无法满足应用所需要的越来越大的计算规模的需求。而传输协议包的结构类似以太网中节点之间的数据交换形式，引入额外协议电路的同时也较为显著的降低了传输效率。The interconnection structure needs to ensure the flexibility of the array while considering the bandwidth of the external connection switch to improve the computing throughput of the chip. Some of the existing interconnection structures adopt the connecting-box and switching-box structures in FPGA, which are very flexible, but the problem is that there are too many configuration points, and the amount of reconstructed information is too large, which makes it impossible to complete dynamically and reduces the area utilization efficiency. Limit the scope of application (such as embedded). The dynamic reconfigurable array module also uses the mesh full interconnection structure and transmission protocol package on the NoC: the cost of the Mesh full interconnection area is huge, and the number of computing units that can be accommodated on the same size chip area is reduced, so the area efficiency is also not good. The high problem cannot meet the needs of the increasingly large computing scale required by the application. The structure of the transmission protocol packet is similar to the data exchange form between nodes in the Ethernet, and the introduction of additional protocol circuits also significantly reduces the transmission efficiency.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种全互联路由结构动态可重构处理器，提高面积效率和传输效率。 The technical problem to be solved by the present invention is to provide a dynamic reconfigurable processor with fully interconnected routing structure to improve area efficiency and transmission efficiency. the

为了解决上述问题，本发明公开了一种全互联路由结构动态可重构处理器，包括输入缓存阵列模块，可重构阵列模块，输出缓存阵列模块，连接开关一，连接开关二，连接开关三和连接开关四；可重构阵列模块相邻两列之间全互联；连接开关一宽度与输入缓存阵列模块列宽相同，连接开关二和连接开关三宽度与可重构阵列模块列宽相同，连接开关四与输出缓存阵列模块列宽相同； In order to solve the above problems, the present invention discloses a fully interconnected routing structure dynamic reconfigurable processor, including an input buffer array module, a reconfigurable array module, an output buffer array module, a connection switch 1, aconnection switch 2, and a connection switch 3 and connection switch four; two adjacent columns of the reconfigurable array module are fully interconnected; the width of the connection switch one is the same as the column width of the input buffer array module, the width of the connection switch two and the connection switch three is the same as the column width of the reconfigurable array module, Connection switch four is the same as the column width of the output buffer array module;

输入缓存阵列模块与连接开关一互联，连接开关一与连接开关二全互联，连接开关二与可重构阵列模块互联，可重构阵列模块与连接开关三互联，连接开关三与连接开关四全互联，连接开关四与输出缓存阵列模块互联； The input buffer array module is interconnected with the connection switch 1, the connection switch 1 is fully interconnected with theconnection switch 2, theconnection switch 2 is connected with the reconfigurable array module, the reconfigurable array module is connected with the connection switch 3, and the connection switch 3 is fully connected with theconnection switch 4 Interconnection, connecting switch four and the output buffer array module interconnection;

其中，所述的连接开关一包括与输入缓存阵列模块行的个数相同的子开关，所述的连接开关一的每个子开关连接输入缓存阵列的一行缓存单元； Wherein, the first connection switch includes the same sub-switches as the number of rows of the input buffer array module, and each sub-switch of the first connection switch is connected to a row of buffer units of the input buffer array;

所述的连接开关二包括与可重构阵列模块行的个数相同的子开关，所述的连接开关二的每个子开关的一端连接可重构阵列模块的一行计算单元； The second connection switch includes the same number of sub-switches as the number of reconfigurable array module rows, and one end of each sub-switch of the second connection switch is connected to a row of computing units of the reconfigurable array module;

所述的连接开关三包括与可重构阵列模块的行的个数相同的子开关，所述的连接开关三的每个子开关的一端连接可重构阵列模块的一行计算单元； The connection switch three includes the same number of sub-switches as the number of rows of the reconfigurable array module, and one end of each sub-switch of the connection switch three is connected to a row of computing units of the reconfigurable array module;

所述的连接开关四包括与输出缓存阵列模块行的个数相同的子开关，所述的连接开关四的每个子开关的一端连接输出缓存阵列模块的一行缓存单元。 Theconnection switch 4 includes the same number of sub-switches as the number of rows of the output buffer array module, and one end of each sub-switch of theconnection switch 4 is connected to a row of buffer units of the output buffer array module. the

优选的，所述的输入缓存阵列模块为输入FIFO，所述的输出缓存阵列模块为输出FIFO。 Preferably, the input buffer array module is an input FIFO, and the output buffer array module is an output FIFO. the

优选的，所述的可重构阵列模块按列互联形成一维环状结构。 Preferably, the reconfigurable array modules are interconnected in columns to form a one-dimensional ring structure. the

与现有技术相比，本发明具有以下优点： Compared with prior art, the present invention has the following advantages:

本发明采用分层全互联的方式，输入缓存阵列模块通过两个独立的相互全互联连接开关连接计算阵列模，可重构阵列模块又通过两个独立的相互全互联的连接开关连接输出缓存阵列模块，降低了硬件开销和配置信息的开销，提高了面积效率。 The present invention adopts a layered full interconnection method, the input buffer array module is connected to the computing array module through two independent interconnection switches, and the reconfigurable array module is connected to the output buffer array through two independent interconnection switches. module, which reduces hardware overhead and configuration information overhead, and improves area efficiency. the

附图说明Description of drawings

图1是本发明的一种分层全互联结构示意图； Fig. 1 is a schematic diagram of a hierarchical fully interconnected structure of the present invention;

图2是传统全互联结构示意图； Figure 2 is a schematic diagram of a traditional fully interconnected structure;

图3是本发明一种实施例的输入缓存选1到连接开关一电路示意图； Fig. 3 is a schematic diagram of a circuit from input buffer selection 1 to connection switch of an embodiment of the present invention;

图4是本发明一种实施例连接开关一到连接开关二电路示意图； Fig. 4 is a schematic diagram of a connection switch one to a connection switch two circuit diagram of an embodiment of the present invention;

图5是本发明一种实施例案连接开关二到可重构阵列模块的选1电路示意图。 FIG. 5 is a schematic diagram of a select 1circuit connecting switch 2 to a reconfigurable array module according to an embodiment of the present invention. the

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。 In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. the

参照图1示出了本发明的一种分层全互联结构示意图。 Referring to FIG. 1 , a schematic diagram of a layered fully interconnected structure of the present invention is shown. the

包括输入缓存阵列模块11，可重构阵列模块14，输出缓存阵列模块17，连接开关一12，连接开关二13，连接开关三15和连接开关四16；可重构阵列模块相邻两列之间全互联；连接开关一12宽度与输入缓存阵列模块11列宽相同，连接开关二13和连接开关三15宽度与可重构阵列模块14列宽相同，连接开关四16与输入缓存阵列模块17列宽相同； Including input cache array module 11,reconfigurable array module 14, outputcache array module 17, connection switch one 12, connection switch two 13, connection switch three 15 and connection switch four 16; The width of theconnection switch 12 is the same as that of the input buffer array module 11, the width of theconnection switch 2 13 and the connection switch 3 15 is the same as the column width of thereconfigurable array module 14, and theconnection switch 4 16 and the inputbuffer array module 17 same column width;

输入缓存阵列模块11与连接开关一12互联，连接开关一12与连接开关二13全互联，连接开关二13与可重构阵列模块14互联，可重构阵列模块14与连接开关三15互联，连接开关三15与连接开关四16全互联，连接开关四16与输出缓存阵列模块17分互联； The input buffer array module 11 is interconnected with the connection switch one 12, the connection switch one 12 is fully interconnected with the connection switch two 13, the connection switch two 13 is interconnected with thereconfigurable array module 14, and thereconfigurable array module 14 is interconnected with the connection switch three 15, The connection switch 3 15 is fully interconnected with theconnection switch 4 16, and theconnection switch 4 16 is separately interconnected with the outputbuffer array module 17;

其中输入缓存阵列模块规模为2^m1x2ⁿ¹，其中2^m1为行数，2ⁿ¹为列数；可重构阵列模块规模为2^m2x2ⁿ²，其中2^m2为行数，2ⁿ²为列数；输出缓存阵列模块规模为2^m3x2ⁿ³，其中2^m3为行数，2ⁿ³为列数；并和输入缓存规模一致。 The size of the input cache array module is 2^m1 x2ⁿ¹ , where 2^m1 is the number of rows and 2ⁿ¹ is the number of columns; the size of the reconfigurable array module is 2^m2 x2ⁿ² , where 2^m2 is the number of rows and 2ⁿ² is the number of columns; The size of the output cache array module is 2^m3 x2ⁿ³ , where 2^m3 is the number of rows and 2ⁿ³ is the number of columns; and it is consistent with the size of the input cache.

数据由输入缓存阵列模块11输入，然后输入缓存阵列模块11选择一列数据到连接开关一12，连接开关一12将数据输送到连接开关二13，连接开关13选择输入到可重构阵列模块14的一列进行计算单元，数据只可从此列传入相全互联的下一列计算单元进行处理，当可重构阵列模块数据在某一列处理完毕后，处理完数据的那一列计算单元连接到连接开关15，将数据从可重构阵列模块输送到连接开关15，连接开关15将数据输送到连接开关16，连接开关16选择输出缓存的一列连接将数据输出。 The data is input by the input cache array module 11, and then the input cache array module 11 selects a row of data to the connection switch one 12, and the connection switch one 12 transmits the data to the connection switch two 13, and theconnection switch 13 selects the data input to thereconfigurable array module 14 One column is the computing unit, and the data can only be transmitted from this column to the next fully interconnected computing unit for processing. When the data of the reconfigurable array module is processed in a certain column, the column of computing unit that has processed the data is connected to theconnection switch 15 , the data is sent from the reconfigurable array module to theconnection switch 15, and theconnection switch 15 sends the data to theconnection switch 16, and theconnection switch 16 selects a column connection of the output buffer to output the data. the

其中输入缓存阵列模块的每行按需求与连接开关一全互联；可重构阵列模块的每行按需求与连接开关二全互联；可重构阵列模块的每行按需求与连接开关三全互联；输出缓存的每行按需求与连接开关四全互联； Each row of the input cache array module is fully interconnected with the connection switch 1 according to the requirement; each row of the reconfigurable array module is fully interconnected with theconnection switch 2 according to the requirement; each row of the reconfigurable array module is fully interconnected with the connection switch 3 according to the requirement ; Each line of the output buffer is fully interconnected with the connection switch four according to the demand;

其中输入缓存阵列模块优选为输入FIFO(First In Out，一种先进先出的数据缓存器)，输出缓存阵列模块优选为输出FIFO。 Wherein the input buffer array module is preferably an input FIFO (First In Out, a first-in-first-out data buffer), and the output buffer array module is preferably an output FIFO. the

优选的，可重构阵列模块按列互联形成一维环状结构，例如可重构阵列模块从左至右有4列，那么第1列连接第2列，第2列连接第3列，第3列连接第4列，第4列连接第1列，并且相邻两列之间采取全互联。 Preferably, the reconfigurable array modules are interconnected by column to form a one-dimensional ring structure. For example, the reconfigurable array module has 4 columns from left to right, then the first column is connected to the second column, the second column is connected to the third column, and the second column is connected to the third column. Column 3 is connected tocolumn 4,column 4 is connected to column 1, and two adjacent columns are fully interconnected. the

其中，连接开关一包括与输入缓存阵列模块行的个数相同的子开关，所述的连接开关一的每个子开关连接输入缓存阵列的一行缓存单元。 Wherein, the first connection switch includes the same number of sub-switches as the number of rows of the input buffer array module, and each sub-switch of the first connection switch is connected to a row of buffer units of the input buffer array. the

连接开关二包括与可重构阵列模块行的个数相同的子开关，所述的连接开关二的每个子开关的一端连接可重构阵列模块的一行计算单元。 Theconnection switch 2 includes the same number of sub-switches as the number of reconfigurable array module rows, and one end of each sub-switch of theconnection switch 2 is connected to a row of computing units of the reconfigurable array module. the

连接开关三包括与可重构阵列模块行的个数相同的子开关，所述的连接开关三的每个子开关的一端连接可重构阵列模块的一行计算单元。 The connection switch three includes the same number of sub-switches as the number of reconfigurable array module rows, and one end of each sub-switch of the connection switch three is connected to a row of computing units of the reconfigurable array module. the

连接开关四包括与输出缓存阵列模块行的个数相同的子开关，所述的连接开关四的每个子开关的一端连接输出缓存阵列模块的一行缓存单元。 Connection switch 4 includes the same number of sub-switches as the number of output buffer array module rows, and one end of each sub-switch of theconnection switch 4 is connected to a row of buffer units of the output buffer array module. the

为了充分说明本发明的优点，下面给示出了传统传互联结构作为对比。 In order to fully illustrate the advantages of the present invention, the traditional transmission interconnection structure is shown below as a comparison. the

参照图2，示出了传统全互联结构示意图。 Referring to FIG. 2 , it shows a schematic diagram of a traditional fully interconnected structure. the

传统结构中输入缓存模块21的所有单元与可重构阵列模块22的所有单元之间全互联，可重构阵列模块22的所有单元相互全互联，可重构阵列模块22的所有单元与输出缓存模块23的所有单元之间全互联。比如输入缓存模块为4x2的阵列，可重构阵列模块为4x4的阵列，输出缓存模块为4x2的阵列，那么输入缓存模块的8个单元每个都与与可重构阵列模块的16个单元都相连接，可重构阵列模块的16个单元之间相互连接，输出缓存模块的8个单元每个都与可重构阵列模块的16个单元相连接。其中输入缓存阵列模块规模为2^m1x2ⁿ¹，可重构阵列模块规模为2^m2x2ⁿ²，输出缓存阵列模块规模为2^m3x2ⁿ³并和输入缓存规模一致。 In the traditional structure, all units of the input cache module 21 are fully interconnected with all units of the reconfigurable array module 22, all units of the reconfigurable array module 22 are fully interconnected with each other, and all units of the reconfigurable array module 22 are connected to the output cache All units of the module 23 are fully interconnected. For example, the input cache module is a 4x2 array, the reconfigurable array module is a 4x4 array, and the output cache module is a 4x2 array, then each of the 8 units of the input cache module is connected to the 16 units of the reconfigurable array module The 16 units of the reconfigurable array module are connected to each other, and each of the 8 units of the output buffer module is connected to the 16 units of the reconfigurable array module. The size of the input cache array module is 2^m1 x2ⁿ¹ , the size of the reconfigurable array module is 2^m2 x2ⁿ² , and the size of the output cache array module is 2^m3 x2ⁿ³ which is consistent with the size of the input cache.

在传统的全互联结构中，可重构阵列模块为了保证足够的灵活性，需要采用全互联的策略。如上所述，互联的位置分成三个部分： In the traditional fully interconnected structure, in order to ensure sufficient flexibility, the reconfigurable array module needs to adopt a fully interconnected strategy. As mentioned above, interconnected locations are divided into three parts:

1)输入端到计算单元之间的互联 1) The interconnection between the input terminal and the computing unit

2)计算单元之间的互联 2) Interconnection between computing units

3)计算单元到输出端之间的互联 3) The interconnection between the computing unit and the output terminal

图中给出的数种数据互联方式均为示例。 The several data interconnection methods shown in the figure are examples. the

图中输入缓存的宽度为2^m1个数据，一次运算所需的数据深度为2ⁿ¹行。(由于一个任务图所需的数据可能大于缓存宽度2^m1，故存在大于等于1的2ⁿ¹参数。)(使用2的幂指数表示是为了更好的和硬件实现对应，指数上的数据即为编码地址) In the figure, the width of the input cache is 2^m1 data, and the data depth required for one operation is 2ⁿ¹ lines. (Because the data required by a task map may be greater than thecache width 2^m1 , there are 2ⁿ¹ parameters greater than or equal to 1.) (The power of 2 is used to better correspond to hardware implementation, and the data on the index is encoded address)

阵列规模为2^m2x2ⁿ²，数据沿着2ⁿ²的方向流动。(由于计算电路总是带有方向性，故可重构阵列模块总能划归成一维的计算数据流。) The size of the array is 2^m2 x2ⁿ² , and the data flows along the direction of 2ⁿ² . (Since computing circuits are always directional, reconfigurable array modules can always be classified as one-dimensional computing data flows.)

同理输出缓存的宽度为2^m3x2ⁿ³，和输入缓存一致。 Similarly, the width of the output buffer is 2^m3 x2ⁿ³ , which is consistent with the input buffer.

为了更清楚地说明互联的硬件规模，这里先举一个简单的例子：当两排数据之间采用全互联结构，第一排有2^m个数据，是源，第二排有2ⁿ个数据，是目标。 In order to illustrate the hardware scale of the interconnection more clearly, here is a simple example: when two rows of data adopt a fully interconnected structure, the first row has 2^m data, which is the source, and the second row has 2ⁿ data, is the target.

1)先从m个数据中选择1个数据给目标地址1： 1) First select 1 data from the m data to the target address 1:

mux开销：2^m-1+2^m-2+...+2⁰＝2^m-1(mux为2选1基本电路单元) Mux overhead: 2^m-1 +2^m-2 +...+2⁰ = 2^m -1 (mux is a 2-to-1 basic circuit unit)

2)再重复上述操作2ⁿ-1次，使得2ⁿ个目标均获得输入数据： 2) Repeat the above operation 2ⁿ -1 times, so that all 2ⁿ targets can obtain input data:

按乘法乘法原理得：(2^m-1)*2ⁿAccording to the principle of multiplication and multiplication: (2^m -1)*2ⁿ

所以，图1中三部分的全互联结构硬件结构的开销如下： Therefore, the overhead of the hardware structure of the three-part fully interconnected structure in Figure 1 is as follows:

1)输入端到计算单元之间的互联： 1) The interconnection between the input terminal and the computing unit:

(2^m1*2ⁿ¹-1)*(2^m2*2ⁿ²) (1.1) (2^m1 *2ⁿ¹ -1)*(2^m2 *2ⁿ² ) (1.1)

2)计算单元之间的互联： 2) Interconnection between computing units:

(2^m2-1)*2^m2*(2ⁿ²-1+1)/2*(2ⁿ²-1)＝(2^m2-1)*(2ⁿ²-1)*2^m2*2ⁿ²-1 (1.2) (2^m2 -1)*2^m2 *(2ⁿ² -1+1)/2*(2ⁿ² -1)＝(2^m2 -1)*(2ⁿ² -1)*2^m2 *2ⁿ² -1 ( 1.2)

3)计算单元到输出端之间的互联： 3) The interconnection between the computing unit and the output terminal:

(2^m2*2ⁿ²-1)*(2^m3*2ⁿ³) (1.3) (2^m2 *2ⁿ² -1)*(2^m3 *2ⁿ³ ) (1.3)

假定m和n等比例增加，硬件规模和2^m*2ⁿ成正比。 Assuming that m and n increase proportionally, the hardware scale is proportional to 2^m * 2ⁿ .

而图1为本发明的一种分层全互联结构示意图。 Fig. 1 is a schematic diagram of a layered fully interconnected structure of the present invention. the

输入缓存从2ⁿ¹行中选择一列到连接开关一，连接开关一到连接开关二采用全互联，再从2ⁿ²行中选择一列连接到连接开关二，如此完成输入缓存到阵列的互联。阵列到输出缓存的连接方式同理。可重构阵列模块只采用行与行之间的全互联。 The input buffer selects a column from the 2ⁿ¹ rows to connect switch 1, and connects switch 1 to connectswitch 2 with full interconnection, and then selects a column from 2ⁿ² rows to connect to theconnect switch 2, thus completing the interconnection from the input buffer to the array. The array is connected to the output buffer in the same way. Reconfigurable array modules only use full interconnection between rows.

这种互联方式利用算法中数据的局域性，即中相邻的数据大部分情况下在存储器中位置接近。故输入数据可以使用同一行的多个数据，而阵列输入数据位置的调整可以通过阵列中空闲单元进行数据直传，即可以到达所需要的输入位置。当算法映射至可重构阵列模块的过程中，也应当使得计算路径相对比较均衡，这样整个电路图在循环流水的过程中也能取得较高的效率。这说明行间互联也具备算法映射的合理性。 This interconnection method takes advantage of the locality of the data in the algorithm, that is, the adjacent data in the algorithm are mostly close to each other in the memory. Therefore, the input data can use multiple data in the same row, and the adjustment of the input data position of the array can be directly transmitted through the idle units in the array, that is, the required input position can be reached. When the algorithm is mapped to the reconfigurable array module, the calculation path should also be relatively balanced, so that the entire circuit diagram can also achieve higher efficiency in the process of circulating water. This shows that inter-line interconnection also has the rationality of algorithm mapping. the

这里硬件开销的计算如下： The calculation of the hardware overhead here is as follows:

先选择2ⁿ¹中的一行2^m1个数据，需要(2ⁿ¹-1)*2^m1个mux；再实现2^m1到2^m2个数据之间的全互联，需要(2^m1-1)*2^m2个mux；最后实现阵列2ⁿ¹行中选择一行2^m2个数据，需要(2ⁿ¹-1)*2^m2个mux。 First select a row of 2^m1 data in 2ⁿ¹ , which requires (2ⁿ¹ -1)*2^m1 muxes; then realize the full interconnection between 2^m1 and 2^m2 data, which requires (2^m1 -1)*2^m2 Muxes; Finally, to select a row of 2^m2 data in 2ⁿ¹ rows of the array, (2ⁿ¹ -1)*2^m2 muxes are needed.

故，图2三部分互联结构的硬件开销如下： Therefore, the hardware overhead of the three-part interconnection structure in Figure 2 is as follows:

(2ⁿ¹-1)*2^m1+(2^m1-1)*2^m2+(2ⁿ²-1)*2^m2 (2.1) (2ⁿ¹ -1)*2^m1 +(2^m1 -1)*2^m2 +(2ⁿ² -1)*2^m2 (2.1)

2)计算单元之间的互联： 2) Interconnection between computing units:

(2^m2-1)*2^m2*(2ⁿ²-1) (2.2) (2^m2 -1)*2^m2 *(2ⁿ² -1) (2.2)

(2ⁿ²-1)*2^m2+(2^m2-1)*2^m3+(2ⁿ³-1)*2^m3 (2.3) (2ⁿ² -1)*2^m2 +(2^m2 -1)*2^m3 +(2ⁿ³ -1)*2^m3 (2.3)

比较本发明和传统方案中输入互联的部分， $(1.1) / (2.1) &cong; \frac{2^{m 1 + n 1 + m 2 + n 2}}{2^{m 1 + n 1} + 2^{m 1 + m 2} + 2^{m 2 + n 2}},$ 一般情况下n1较小(物理意义：由于输入缓存的宽度较大，一次循环输入用到的缓存行数较小)，则可以得到故当一次循环用到缓存的行数为2，阵列的行数为16，输入缓存一行数据个数为16(即n1＝1，n2＝4，m1＝4)时，本专利方案的硬件开销约为原来的1/16，善加利用了算法映射的特性，在保持阵列灵活性的同时，大大减小阵列的硬件开销。 Compare the part of input interconnection in the present invention and traditional scheme, $(1.1) / (2.1) &cong; \frac{2^{m 1 + no 1 + m 2 + no 2}}{2^{m 1 + no 1} + 2^{m 1 + m 2} + 2^{m 2 + no 2}},$ In general, n1 is small (physical meaning: due to the large width of the input cache, the number of cache lines used for a loop input is small), then you can get Therefore, when the number of rows used in a cycle is 2, the number of rows in the array is 16, and the number of data in one row of the input cache is 16 (i.e. n1=1, n2=4, m1=4), the hardware overhead of the patent solution It is about 1/16 of the original, making good use of the characteristics of algorithm mapping, and greatly reducing the hardware overhead of the array while maintaining the flexibility of the array.

比较本发明和传统方案中阵列互联的部分，当阵列的行数为16(即n2＝4)时，本专利方案的硬件开销约为原来的1/8。 Comparing the part of array interconnection in the present invention and the traditional scheme, When the number of rows of the array is 16 (ie n2=4), the hardware overhead of the patent solution is about 1/8 of the original.

对比三个部分硬件开销的优化幅度如下： Comparing the optimization range of the hardware overhead of the three parts is as follows:

$22^{n no 11} \frac{22^{m m 11 + + n no 22}}{22^{m m 11} + + 22^{n no 22}} - - - - - - ((3.1 3.1))$

2)计算单元之间的互联： 2) Interconnection between computing units:

2^n2-1 (3.2) 2^n2-1 (3.2)

$22^{n no 33} \frac{22^{m m 33 + + n no 22}}{22^{m m 33} + + 22^{n no 22}} - - - - - - ((3.3 3.3))$

由于每个mux均需要进行配置，故在配置点的数量上，和硬件开销类似，也优化了同等幅度。配置信息量的计算略有不同。原方案中假定所有的配置信息均集中在计算单元上，则输入部分需要(m1+n1)*2^m2*2ⁿ²；输出部分需要(m3+n3)*2^m2*2ⁿ²；计算单元之间互联需要(m2+n2)*2^m2*2ⁿ²。而本专利方案所需配置信息，输入部分需要n1*2ⁿ²+m1*2^m2；输出部分需要n3*2ⁿ²+m3*2^m2；计算单元之间互联需要m2*2^m2*2ⁿ²。 Since each mux needs to be configured, the number of configuration points is similar to the hardware overhead, and the same level of optimization is also optimized. The calculation of configuration information volume is slightly different. In the original scheme, it is assumed that all configuration information is concentrated on the computing unit, then the input part needs (m1+n1)*2^m2 *2ⁿ² ; the output part needs (m3+n3)*2^m2 *2ⁿ² ; between the computing units The interconnection requires (m2+n2)*2^m2 *2ⁿ² . The configuration information required by this patent solution requires n1*2ⁿ² + m1*2^m2 for the input part; n3*2ⁿ² + m3*2^m2 for the output part; m2*2^m2 *2ⁿ² for the interconnection between computing units.

对比三个部分配置信息量的优化幅度如下： Comparing the optimization range of the configuration information of the three parts is as follows:

1)输入端互联配置信息： 1) Input interconnection configuration information:

$\frac{((m m 11 + + n no 11)) 22^{m m 22 + + n no 22}}{m m 11 * * 22^{m m 22} + + n no 11 * * 22^{n no 22}} &cong; &cong; 22^{n no 22} - - - - - - ((3.4 3.4))$

2)计算单元互联配置信息： 2) Computing unit interconnection configuration information:

$\frac{((m m 22 + + n no 22)) 22^{m m 22 + + n no 22}}{m m 22 * * 22^{m m 22 + + n no 22}} &cong; &cong; \frac{m m 22 + + n no 22}{m m 22} - - - - - - ((3.5 3.5))$

3)输出端互联配置信息： 3) Output interconnection configuration information:

$\frac{((m m 33 + + n no 33)) 22^{m m 22 + + n no 22}}{m m 33 * * 22^{m m 22} + + n no 33 * * 22^{n no 22}} &cong; &cong; 22^{n no 22} - - - - - - ((33 . . 66))$

当阵列规模为16x16时(即m2＝n2＝4)，输入端的配置信息为原来的1/16；计算单元互联配置信息减少为原来的1/2，输入端的配置信息减少为原来的1/16。 When the array size is 16x16 (that is, m2=n2=4), the configuration information of the input terminal is 1/16 of the original; the interconnection configuration information of the computing unit is reduced to 1/2 of the original, and the configuration information of the input terminal is reduced to 1/16 of the original . the

作为对比，下面为传统全互联结构面积效率实例： For comparison, the following is an example of the area efficiency of the traditional fully interconnected structure:

输入缓存宽度16，单个循环可配置深度8，阵列规模8x8，输出缓存宽度16，单次循环可配置深度8。数据粒度为16比特。采用tsmc65lp工艺，经统计各部分面积比重如下： The input buffer width is 16, the depth of a single cycle can be configured to be 8, the array size is 8x8, the output buffer width is 16, and the depth of a single cycle can be configured to be 8. The data granularity is 16 bits. Using tsmc65lp technology, the area proportion of each part is as follows:

类型名称 type name 面积数(um²)Area (um² ) 百分比percentage the the 数据存储结构Data storage structure 249974249974 23.2％23.2% the the 输出互联output interconnection 211800211800 19.6％19.6% the the 输入互联和阵列互联Input Interconnect and Array Interconnect 328360328360 30.4％30.4% the the 可重构计算单元Reconfigurable Computing Unit 289456289456 26.8％26.8% the the

由于阵列具有可扩展的特性，则当阵列规模为16x16、32x32时，推演各部分面积比重如下(输入输出结构同比增大)： Due to the scalable nature of the array, when the array size is 16x16 and 32x32, the area ratio of each part is deduced as follows (the input and output structures increase year-on-year):

阵列的长宽各扩展4倍以上，可重构处理器95％以上的面积就被互联消耗掉了。互联面积从原来8x8时是计算单元面积开销的约2倍，到32x32时已经是30倍以上。此时只有小于5％的面积用于真正的计算，大于95％的面积用于数据传递，这样的面积效率显失公平，是不可接受的。 The length and width of the array are expanded by more than 4 times, and more than 95% of the area of the reconfigurable processor is consumed by the interconnection. The interconnection area is about 2 times that of the computing unit area when it was originally 8x8, and it is more than 30 times when it is 32x32. At this time, only less than 5% of the area is used for real calculation, and more than 95% of the area is used for data transmission. Such an area efficiency is unfair and unacceptable. the

下面为本发明面积效率实例： Below is the area efficiency example of the present invention:

当采用本发明所述的分层全互联结构时，面积比重变化为： When adopting the layered fully interconnected structure described in the present invention, the area proportion changes as:

类型名称 type name 面积数(um²)Area (um² ) 百分比percentage the the 数据存储结构Data storage structure 249974249974 39.7％39.7% the the 输出互联output interconnection 3530035300 5.6％5.6% the the 输入互联和阵列互联Input Interconnect and Array Interconnect 5470054700 8.7％8.7% the the 可重构计算单元Reconfigurable Computing Unit 289456289456 46.0％46.0% the the

当阵列规模扩展时： When the array scales up:

阵列的长宽各扩展4倍时，可重构处理器1/3左右的面积用于互联，1/3用于计算，1/3用于存储，保持了一个合理的面积效率。输入输出部分的互联开销和计算单元的增加一致，主要在于阵列宽度的增加，导致阵列内部的互联以大于一次方的关系增加。一般情况下，在阵列规模更大的情况下，整个可重构处理器的面积效率还是可以保持在 When the length and width of the array are expanded by 4 times, about 1/3 of the area of the reconfigurable processor is used for interconnection, 1/3 is used for computing, and 1/3 is used for storage, maintaining a reasonable area efficiency. The interconnection overhead of the input and output part is consistent with the increase of the computing unit, mainly due to the increase of the array width, which leads to the increase of the interconnection within the array with a relationship greater than one power. In general, the area efficiency of the entire reconfigurable processor can still be maintained at

相对传统全互联结构，这是十分高效的面积使用方式。 Compared with the traditional fully interconnected structure, this is a very efficient way to use the area. the

故，不论从面积使用效率，还是配置信息量的压缩(服务于动态重构)，上述分层全互联结构均为一种非常优秀的动态可重构处理器互联结构。 Therefore, regardless of the area utilization efficiency or the compression of configuration information (serving for dynamic reconfiguration), the above-mentioned layered full interconnection structure is a very excellent dynamic reconfigurable processor interconnection structure. the

图3示出了当输入FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)，可重构阵列模块大小为4x4，输出FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)时，输入FIFO到连接开关一的电路示意图。p1为输入FIFO的宽度(也即行数)，p2为输入FIFO的深度(即列数)。 Figure 3 shows that when the input FIFO width is 4 (that is, the number of rows is 4), the data depth of a single cycle is 2 (that is, the number of columns is 2), the reconfigurable array module size is 4x4, and the output FIFO width is 4 (that is, the number of rows is 4), and when the data depth of a single cycle is 2 (that is, the number of columns is 2), the schematic diagram of the circuit from the input FIFO to the connection switch 1. p1 is the width of the input FIFO (that is, the number of rows), and p2 is the depth of the input FIFO (that is, the number of columns). the

图4示出了当输入FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)，可重构阵列模块大小为4x4，输出FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)时，连接开关一到连接开关二的电路示意图 Figure 4 shows that when the input FIFO width is 4 (that is, the number of rows is 4), the data depth of a single cycle is 2 (that is, the number of columns is 2), the reconfigurable array module size is 4x4, and the output FIFO width is 4 (That is, the number of rows is 4), when the data depth of a single cycle is 2 (that is, the number of columns is 2), the circuit diagram of the connection switch 1 to theconnection switch 2

图5示出了当输入FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)，可重构阵列模块大小为4x4，输出FIFO宽度为4(也即行数为4)，单次循环数据深度为2(也即为列数为2)时，连接开关二到可重构阵列模块的电路示意图。q1为行地址0，q2为行地址1。 Figure 5 shows that when the input FIFO width is 4 (that is, the number of rows is 4), the data depth of a single cycle is 2 (that is, the number of columns is 2), the reconfigurable array module size is 4x4, and the output FIFO width is 4 (that is, the number of rows is 4), and when the data depth of a single cycle is 2 (that is, the number of columns is 2), the schematic diagram of thecircuit connecting switch 2 to the reconfigurable array module. q1 is row address 0, and q2 is row address 1. the

图中a1开关中，如果选择第一行，则{行地址0，行地址1}＝11，选择第二行，则{行地址0，行地址1}＝01，选择第三行，则{行地址0，行地址1}＝10，选择第四行，则{行地址0，行地址1}＝00。 In the a1 switch in the figure, if the first row is selected, {row address 0, row address 1}=11, if the second row is selected, {row address 0, row address 1}=01, and the third row is selected, then { Row address 0, row address 1}=10, select the fourth row, then {row address 0, row address 1}=00. the

当{行地址0，行地址1}＝11时，开关s1打开，开关s5打开，选通第一行，开关s2关闭，开关s6打开，不选通第二行，开关s3打开，开关s7关闭，不选通第三行，开关s4关闭，开关s8关闭，不选通第四行。 When {row address 0, row address 1}=11, the switch s1 is turned on, the switch s5 is turned on, the first row is selected, the switch s2 is turned off, the switch s6 is turned on, the second row is not selected, the switch s3 is turned on, and the switch s7 is turned off , the third line is not selected, the switch s4 is closed, the switch s8 is closed, and the fourth line is not selected. the

开关电路有很多种实现形式，比如晶体管就是一个开关。 There are many implementation forms of switching circuits, for example, a transistor is a switch. the

以上对本发明所提供的一种全互联路由结构动态可重构处理器进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。 A dynamic reconfigurable processor with fully interconnected routing structure provided by the present invention has been introduced in detail above. In this paper, specific examples are used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only for helping Understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and scope of application. In summary, the content of this specification is not It should be understood as a limitation of the present invention. the

Claims

1. complete interconnected routing infrastructure dynamic reconfigurable processor is characterized in that:

Comprise the input-buffer array module, reconfigurable arrays module, output array cache module, connecting valve one, connecting valve two, connecting valve three and connecting valve four; Entirely interconnected between adjacent two row of reconfigurable arrays module; Connecting valve one width is identical with input-buffer array module col width, and connecting valve two is identical with reconfigurable arrays module col width with connecting valve three width, and connecting valve four is identical with output array cache module col width;

Input-buffer array module and connecting valve one are interconnected, connecting valve one is entirely interconnected with connecting valve two, connecting valve two is interconnected with the reconfigurable arrays module, reconfigurable arrays module and connecting valve three are interconnected, connecting valve three is entirely interconnected with connecting valve four, and connecting valve four is interconnected with output array cache module;

Wherein, described connecting valve one comprises the sub-switch identical with the number of input-buffer array module row, and every sub-switch of described connecting valve one connects the delaying one-row unit of input-buffer array;

Described connecting valve two comprises the sub-switch identical with the number of reconfigurable arrays module row, and an end of every sub-switch of described connecting valve two connects delegation's computing unit of reconfigurable arrays module;

Described connecting valve three comprises the sub-switch identical with the number of the row of reconfigurable arrays module, and an end of every sub-switch of described connecting valve three connects delegation's computing unit of reconfigurable arrays module;

Described connecting valve four comprises the sub-switch identical with the number of output array cache module row, and an end of every sub-switch of described connecting valve four connects the delaying one-row unit of output array cache module.

2. complete interconnected routing infrastructure dynamic reconfigurable processor as claimed in claim 1 is characterized in that:

Described input-buffer array module is input FIFO, and described output array cache module is output FIFO.

3. complete interconnected routing infrastructure dynamic reconfigurable processor as claimed in claim 1 is characterized in that:

Described reconfigurable arrays module is by the interconnected formation one dimension ring texture of row.