CN115437602A

Movatterモバイル変換

Info

Publication number: CN115437602A
Application number: CN202210990132.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-12-06
Also published as: WO2023065701A1; CN114003198B; CN114003198A

Abstract

Translated fromChinese

本发明涉及一种任意精度计算设备、方法及计算机可读存储介质，自片外内存读取多个操作数；将所述多个操作数拆分成多个向量，所述多个向量包括第一向量及第二向量；根据所述第一向量及所述第二向量的长度，内积所述第一向量与所述第二向量，以获得内积结果；将所述内积结果整合成所述多个操作数的计算结果；以及将所述计算结果存储至所述片外内存。

The present invention relates to an arbitrary-precision computing device, method, and computer-readable storage medium, which read multiple operands from an off-chip memory; split the multiple operands into multiple vectors, and the multiple vectors include the first A vector and a second vector; according to the lengths of the first vector and the second vector, inner product the first vector and the second vector to obtain an inner product result; integrate the inner product result into calculation results of the plurality of operands; and storing the calculation results in the off-chip memory.

Description

Translated fromChinese

任意精度计算加速器、集成电路装置、板卡及方法Arbitrary precision computing accelerator, integrated circuit device, board and method

本申请是申请号为202111221317.4、申请日为2021年10月20日、发明名称为“任意精度计算加速器、集成电路装置、板卡及方法”的分案申请。This application is a divisional application with the application number 202111221317.4, the application date is October 20, 2021, and the invention title is "arbitrary precision computing accelerator, integrated circuit device, board and method".

技术领域technical field

本发明一般地涉及计算机领域。更具体地，本发明涉及任意精度计算加速器、集成电路装置、板卡及方法。The present invention relates generally to the field of computers. More specifically, the present invention relates to arbitrary precision computing accelerators, integrated circuit devices, boards and methods.

背景技术Background technique

任意精确计算是利用任意位数来表示操作数，在许多技术领域至关重要，例如超新星模拟、气候模拟、原子模拟、人工智能、行星轨道计算等。这些领域需要处理数百、甚至数千或数百万位数的数据，这样大范围的数据位数处理远远超出了传统处理器的硬件能力。Arbitrary precision calculation is the use of arbitrary digits to represent operands, which is crucial in many technical fields, such as supernova simulation, climate simulation, atomic simulation, artificial intelligence, planetary orbit calculation, etc. These fields need to process hundreds, even thousands or millions of bits of data, such a large range of data bits processing is far beyond the hardware capabilities of traditional processors.

即使现有技术使用高位宽的处理器，也无法处理任意精确计算操作所需的可变长度，原因在于：最优比特宽在不同算法之间变化很大，且比特宽的细微差异会导致显著的成本差异。再者，现有技术还提出了许多提高体系结构级计算效率的技术，主要是纯效计算(effectual-only computation)和近似计算，前者只执行基本计算，其中无效的计算像是稀疏化和重复数据会被跳过或消除，后者使用较不准确的数据像是低位宽数据或量化后数据，来代替原始的准确数据的计算。然而，对于纯效计算来说，要找到重复数据十分困难且昂贵，对于近似计算来说，它直观地与任意精确计算的目的相矛盾，任意精确计算需要精确的计算来获得较高的精度。最后，这些现有技术不可避免地都会导致大量低效的内存访问。Even state-of-the-art processors using high bit-widths cannot handle the variable lengths required for arbitrary precise computational operations because the optimal bit-width varies widely between algorithms and small differences in bit-width can lead to significant cost difference. Furthermore, the prior art also proposes many techniques to improve computational efficiency at the architecture level, mainly effectual-only computation and approximate computation. Data is skipped or eliminated, the latter using less accurate data such as low-bit-width data or quantized data instead of the original accurate data for calculations. However, for purely efficient computations it is difficult and expensive to find duplicates, and for approximate computations it intuitively contradicts the purpose of arbitrarily exact computations, which require exact computations to achieve high precision. In the end, these existing techniques inevitably lead to a large number of inefficient memory accesses.

因此，一种高效的任意精确计算方案是迫切需要的。Therefore, an efficient arbitrarily precise computation scheme is urgently needed.

发明内容Contents of the invention

为了至少部分地解决背景技术中提到的技术问题，本发明的方案提供了一种任意精度计算加速器、集成电路装置、板卡及方法。In order to at least partly solve the technical problems mentioned in the background art, the solution of the present invention provides an arbitrary precision computing accelerator, an integrated circuit device, a board and a method.

在一个方面中，本发明揭露一种用以内积第一向量与第二向量的处理部件，包括：转换单元、多个内积单元及合成单元。转换单元用以根据第一向量的长度及位宽生成多个模式向量。每个内积单元基于第二向量在长度方向上的数据向量为索引，累加多个模式向量中的特定模式向量，以形成单位累加数列。合成单元用以加总多个单位累加数列，以获得内积结果。In one aspect, the present invention discloses a processing unit for inner product of a first vector and a second vector, including: a conversion unit, a plurality of inner product units and a combination unit. The conversion unit is used for generating multiple pattern vectors according to the length and bit width of the first vector. Each inner product unit accumulates a specific pattern vector among the plurality of pattern vectors based on the data vector in the length direction of the second vector as an index to form a unit accumulation sequence. The synthesis unit is used to add up multiple unit accumulation sequences to obtain an inner product result.

在另一个方面，本发明揭露一种任意精度计算加速器，连接至片外内存，任意精度计算加速器包括：核内存代理器、核控制器及处理阵列。核内存代理器用以自片外内存读取多个操作数。核控制器用以将多个操作数拆分成多个向量，多个向量包括第一向量及第二向量。处理阵列包括多个处理部件，处理部件用以根据第一向量及第二向量的长度，内积第一向量与第二向量，以获得内积结果。其中，核控制器将内积结果整合成多个操作数的计算结果，核内存代理器将计算结果存储至片外内存。In another aspect, the present invention discloses an arbitrary-precision computing accelerator connected to an off-chip memory. The arbitrary-precision computing accelerator includes: a kernel memory agent, a kernel controller, and a processing array. The kernel memory agent is used to read multiple operands from off-chip memory. The core controller is used for splitting multiple operands into multiple vectors, and the multiple vectors include a first vector and a second vector. The processing array includes a plurality of processing units for inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result. Among them, the core controller integrates the inner product result into the calculation result of multiple operands, and the core memory agent stores the calculation result in the off-chip memory.

在另一个方面，本发明揭露一种集成电路装置，包括前述的任意精度计算加速器、处理装置及片外内存。处理装置用以控制任意精度计算加速器，片外内存包括LLC。其中，任意精度计算加速器与处理装置通过LLC联系。In another aspect, the present invention discloses an integrated circuit device including the aforementioned arbitrary precision computing accelerator, a processing device and an off-chip memory. The processing device is used to control the arbitrary precision computing accelerator, and the off-chip memory includes LLC. Wherein, the arbitrary-precision computing accelerator communicates with the processing device through the LLC.

在另一个方面，本发明揭露一种板卡，包括前述的集成电路装置。In another aspect, the present invention discloses a board including the aforementioned integrated circuit device.

在另一个方面，本发明揭露一种内积第一向量与第二向量的方法，包括：根据第一向量的长度及位宽生成多个模式向量；基于第二向量在长度方向上的数据向量为索引，累加多个模式向量中的特定模式向量，以形成多个单位累加数列；以及加总多个单位累加数列，以获得内积结果。In another aspect, the present invention discloses a method for inner producting a first vector and a second vector, comprising: generating a plurality of pattern vectors according to the length and bit width of the first vector; based on the data vector in the length direction of the second vector accumulating specific pattern vectors among the plurality of pattern vectors as an index to form a plurality of unit accumulation sequences; and summing up the plurality of unit accumulation sequences to obtain an inner product result.

在另一个方面，本发明揭露一种任意精度计算方法，包括：自片外内存读取多个操作数；将多个操作数拆分成多个向量，多个向量包括第一向量及第二向量；根据第一向量及第二向量的长度，内积第一向量与第二向量，以获得内积结果；将内积结果整合成多个操作数的计算结果；以及将计算结果存储至片外内存。In another aspect, the present invention discloses an arbitrary precision calculation method, including: reading multiple operands from off-chip memory; splitting the multiple operands into multiple vectors, the multiple vectors including the first vector and the second vector; according to the lengths of the first vector and the second vector, inner product the first vector and the second vector to obtain the inner product result; integrate the inner product result into a calculation result of multiple operands; and store the calculation result in a slice external memory.

在另一个方面，本发明揭露一种计算机可读存储介质，其上存储有任意精度计算的计算机程序代码，当所述计算机程序代码由处理装置运行时，执行前述的方法。In another aspect, the present invention discloses a computer-readable storage medium, on which computer program codes for arbitrary precision calculations are stored, and when the computer program codes are executed by a processing device, the aforesaid methods are executed.

本发明提出一种处理任意精度计算方案，并行处理不同的比特流，其部署了完整的比特串行数据路径，以灵活弹性地执行高精度计算。本发明充分利用简易硬件配置，减少重复计算，进而实现低能耗的任意精确计算。The present invention proposes a scheme for processing arbitrary-precision calculations, and processes different bit streams in parallel, deploying a complete bit-serial data path to perform high-precision calculations flexibly and flexibly. The invention makes full use of simple hardware configuration, reduces repeated calculations, and further realizes arbitrary accurate calculations with low energy consumption.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本发明的若干实施方式，并且相同或对应的标号表示相同或对应的部分。其中：The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present invention are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts. in:

图1是示出本发明实施例的板卡的结构图；Fig. 1 is the structural diagram showing the plate card of the embodiment of the present invention;

图2是示出本发明实施例的集成电路装置的结构图；FIG. 2 is a structural diagram showing an integrated circuit device according to an embodiment of the present invention;

图3是示出本发明实施例的计算装置的内部结构示意图；3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention;

图4是示出示例性乘法运算的示意图；Figure 4 is a schematic diagram illustrating an exemplary multiplication operation;

图5是示出本发明实施例的转换单元的示意图；FIG. 5 is a schematic diagram illustrating a conversion unit according to an embodiment of the present invention;

图6是示出本发明实施例的生成单元的示意图；Fig. 6 is a schematic diagram showing a generation unit of an embodiment of the present invention;

图7是示出本发明实施例的内积单元的示意图；Fig. 7 is a schematic diagram showing an inner product unit according to an embodiment of the present invention;

图8是示出本发明实施例的合成单元的示意图；Figure 8 is a schematic diagram illustrating a synthesis unit of an embodiment of the present invention;

图9是示出本发明实施例的全加器组的示意图；Fig. 9 is a schematic diagram showing a full adder group of an embodiment of the present invention;

图10是示出本发明另一实施例的任意精度计算的流程图；以及FIG. 10 is a flowchart illustrating arbitrary precision calculations of another embodiment of the present invention; and

图11是示出本发明另一实施例的内积第一向量与第二向量的流程图。FIG. 11 is a flow chart illustrating the inner product of the first vector and the second vector according to another embodiment of the present invention.

具体实施方式detailed description

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.

应当理解，本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present invention are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" used in the description and claims of the present invention indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.

还应当理解，在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的，而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解，在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be understood that the terms used in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used in the specification and claims herein, the singular forms "a", "an" and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It should be further understood that the term "and/or" used in the description and claims of the present invention refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

如在本说明书和权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context.

下面结合附图来详细描述本发明的具体实施方式。The specific implementation manner of the present invention will be described in detail below in conjunction with the accompanying drawings.

任意精度计算在许多科技领域中都起到关键作用。举例来说，看似平凡的方程式x³+y³+z³＝3，利用计算机求解会需要200位以上的精度；在伊辛理论(Ising theory)中，计算积分需要1000位以上的精度；而计算双曲空间(hyperbolic space)中的结点余空间(knot complement)的体积则涉及高达60000位精度。一个非常微小的精度误差都可能导致计算结果的巨大差异，因此任意精度计算在计算机领域是十分严肃的技术课题。Arbitrary-precision computing plays a key role in many fields of technology. For example, for the seemingly trivial equation x³ +y³ +z³ =3, it requires more than 200 digits of precision to solve it by computer; in the Ising theory (Ising theory), the calculation of the integral requires more than 1000 digits of precision; Computing the volume of the knot complement in hyperbolic space involves up to 60,000 bits of precision. A very small precision error may lead to a huge difference in the calculation results, so arbitrary precision calculation is a very serious technical issue in the computer field.

本发明提出了一种高效的任意精度计算加速器架构，其主要参考内积运算的计算形式，突出加速器架构的操作内并行性(intra-parallelism)和操作间并行性(inter-parallelism)，以实现操作数的乘法运算。The present invention proposes a high-efficiency arbitrary-precision computing accelerator architecture, which mainly refers to the calculation form of the inner product operation, and highlights the intra-parallelism and inter-parallelism of the accelerator architecture to achieve Multiplication of operands.

图1示出本发明实施例的一种板卡10的结构示意图。如图1所示，板卡10包括芯片101，其是一种系统级芯片(System on Chip，SoC)，或称片上系统，集成有一个或多个组合处理装置，组合处理装置是一种人工智能运算单元，用以支持各类深度学习和机器学习算法，满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域，云端智能应用的一个显著特点是输入数据量大，对平台的存储能力和计算能力有很高的要求，此实施例的板卡10适用在云端智能应用，具有庞大的片外存储、片上存储和强大的计算能力。FIG. 1 shows a schematic structural diagram of aboard 10 according to an embodiment of the present invention. As shown in Figure 1, theboard card 10 includes achip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform. Theboard 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.

芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景，对外接口装置102可以具有不同的接口形式，例如PCIe接口等。Thechip 101 is connected to anexternal device 103 through anexternal interface device 102 . Theexternal device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to thechip 101 by theexternal device 103 through theexternal interface device 102 . The calculation result of thechip 101 can be sent back to theexternal device 103 via theexternal interface device 102 . According to different application scenarios, theexternal interface device 102 may have different interface forms, such as a PCIe interface and the like.

板卡10还包括用于存储数据的存储器件104，其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此，在一个应用场景中，控制器件106可以包括单片机(Micro Controller Unit，MCU)。Theboard 10 also includes a storage device 104 for storing data, which includes one ormore storage units 105 . The storage device 104 is connected and data transmitted with thecontrol device 106 and thechip 101 through the bus. Thecontrol device 106 in theboard 10 is configured to regulate the state of thechip 101 . To this end, in an application scenario, thecontrol device 106 may include a microcontroller (Micro Controller Unit, MCU).

图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示，组合处理装置包括计算装置201、处理装置202、片外内存203、通信节点204及接口装置205。在此实施例中，有几种集成方案可以用来协同计算装置201、处理装置202、片外内存203的工作，其中图2A示出LLC集成方案，图2B示出SoC集成方案，图2C示出IO集成方案。FIG. 2 is a block diagram showing the combined processing means in thechip 101 of this embodiment. As shown in FIG. 2 , the combined processing device includes a computing device 201 , a processing device 202 , an off-chip memory 203 , a communication node 204 and an interface device 205 . In this embodiment, there are several integration solutions that can be used to coordinate the work of the computing device 201, the processing device 202, and the off-chip memory 203, wherein FIG. 2A shows an LLC integration solution, FIG. 2B shows an SoC integration solution, and FIG. 2C shows Provide an IO integration solution.

计算装置201配置成执行用户指定的操作，主要实现为多核智能处理器，用以执行深度学习或机器学习的计算，其可以与处理装置202进行交互，以共同完成用户指定的操作。计算装置201内含前述的任意精度计算加速器，用以处理线性计算，更详细来说是应用在如卷积中的操作数乘法运算。The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 202 to jointly complete user-specified operations. The computing device 201 includes the aforementioned arbitrary-precision computing accelerator for processing linear computations, more specifically operand multiplication operations such as convolution.

处理装置202作为通用的处理器，执行包括但不限于数据搬运、对计算装置201的开启和/或停止、非线性计算等基本控制。根据实现方式的不同，处理装置202可以是中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器，这些处理器包括但不限于数字信号处理器(digital signal processor，DSP)、专用集成电路(application specificintegrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。当将计算装置201和处理装置202整合共同考虑时，二者视为形成异构多核结构。As a general-purpose processor, the processing device 202 performs basic controls including but not limited to data transfer, starting and/or stopping of the computing device 201 , nonlinear calculation, and the like. According to different implementations, the processing device 202 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors. Processors, these processors include but are not limited to digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field-programmable gate array (field-programmable gate array, FPGA) or other Program logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. When considering the integration of the computing device 201 and the processing device 202 together, they are considered to form a heterogeneous multi-core structure.

片外内存203用以存储待处理与处理完的数据，其层次根据延迟时间从小到大，可以划分为：一级缓存(L1)、二级缓存(L2)、三级缓存(L3，又称为LLC)与实体内存。实体内存为DDR，大小通常为16G或更大。当计算装置201或处理装置202欲从片外内存203读取数据时，由于L1的速度最快，故通常会优先访问L1，如果数据未存放在L1，接着访问L2，如果数据亦未存放在L2，继续访问L3，如果数据仍未存放在L3，最后访问DDR。片外内存203的缓存层次结构是通过将最常访问的数据存储在缓存中来加快数据访问速度。与缓存相比，DDR相当慢。随着缓存级别的增加(L1→L2→LLC→DDR)，访问延迟越来越高，但存储空间越来越大。The off-chip memory 203 is used to store the data to be processed and processed, and its hierarchy can be divided into: first-level cache (L1), second-level cache (L2), and third-level cache (L3, also known as for LLC) and physical memory. The physical memory is DDR, usually 16G or larger. When the computing device 201 or the processing device 202 intends to read data from the off-chip memory 203, because L1 is the fastest, it usually accesses L1 first. If the data is not stored in L1, then access L2. If the data is not stored in the L2, continue to access L3, if the data is still not stored in L3, finally access DDR. The cache hierarchy of the off-chip memory 203 speeds up data access by storing the most frequently accessed data in the cache. DDR is pretty slow compared to cache. As the cache level increases (L1→L2→LLC→DDR), the access latency is higher and higher, but the storage space is larger and larger.

通信节点204是片上网络(network-on-chip，NoC)中的路由节点或路由器，当计算装置201或处理装置202产生一个数据包后，会通过特定的接口发送到通信节点204中，通信节点204读取数据包的头微片中的地址信息，利用特定的路由算法计算出最佳路由路径，从而建立可靠的传输路径将数据包送到目的节点(例如片外内存203)。同样地，当计算装置201或处理装置202需从片外内存203读取数据包时，通信节点204亦会计算出最佳路由路径，将数据包从片外内存203发送到计算装置201或处理装置202。The communication node 204 is a routing node or router in a network-on-chip (NoC). When the computing device 201 or the processing device 202 generates a data packet, it will be sent to the communication node 204 through a specific interface. The communication node 204 reads the address information in the header flake of the data packet, and uses a specific routing algorithm to calculate the best routing path, thereby establishing a reliable transmission path to send the data packet to the destination node (such as the off-chip memory 203). Similarly, when the computing device 201 or the processing device 202 needs to read the data packet from the off-chip memory 203, the communication node 204 will also calculate the optimal routing path, and send the data packet from the off-chip memory 203 to the computing device 201 or the processing device 202.

接口装置205是组合处理装置对外的输入输出接口，当组合处理装置与外部设备交换信息时，由于外部设备种类繁多，每种设备对传输的信息的要求各不相同，接口装置205会根据数据传输的发送方与接收方的要求，执行设置数据缓冲以解决两者速度差异所带来的不协调问题、设置信号电平转换、设置信息转换逻辑以满足对各自格式的要求、设置时序控制电路来同步发送方与接收方的工作及提供地址转码等任务。The interface device 205 is the external input and output interface of the combination processing device. When the combination processing device exchanges information with external equipment, due to the wide variety of external equipment, each type of equipment has different requirements for the information to be transmitted. The interface device 205 will transmit information according to the data. According to the requirements of the sender and receiver, set the data buffer to solve the incoordination problem caused by the speed difference between the two, set the signal level conversion, set the information conversion logic to meet the requirements of their respective formats, and set the timing control circuit to Synchronize the work of the sender and receiver and provide address transcoding and other tasks.

图2A的LLC集成指的是计算装置201与处理装置202通过LLC联系，图2B的SoC集成是通过通信节点204来集成计算装置201、处理装置202与片外内存203。图2C的IO集成是通过接口装置205来集成计算装置201、处理装置202与片外内存203。这3种集成方式仅为示例，本发明并不限制集成的方式。The LLC integration in FIG. 2A refers to the communication between the computing device 201 and the processing device 202 through LLC. The SoC integration in FIG. 2B is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the communication node 204 . The IO integration in FIG. 2C is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the interface device 205 . These three integration methods are only examples, and the present invention does not limit the integration methods.

此实施例较佳地选择LLC集成方案。由于深度学习和机器学习的核心是卷积算子，卷积算子的基础是内积运算，内积运算又是由乘法与加法组合而成，因此，计算装置201的主要任务是大量的乘法和加法等低级运算，在执行神经网络模型的训练与推理时，计算装置201与处理装置202需要密集的交互，将计算装置201与处理装置202集成到LLC中，通过LLC共享数据，以达到较低的交互成本。再者，由于高精度数据可能具有数百万位，L1与L2的容量有限，通过L1与L2交互会导致容量不足的问题。计算装置201利用LLC的相对大容量来缓存高精度数据，以节省重复访问的时间。This embodiment preferably chooses the LLC integration scheme. Since the core of deep learning and machine learning is the convolution operator, the basis of the convolution operator is the inner product operation, and the inner product operation is a combination of multiplication and addition. Therefore, the main task of the computing device 201 is a large number of multiplications. When performing neural network model training and reasoning, the computing device 201 and the processing device 202 need intensive interaction. The computing device 201 and the processing device 202 are integrated into the LLC, and the data is shared through the LLC to achieve a higher Low interaction cost. Furthermore, since the high-precision data may have millions of bits, the capacity of L1 and L2 is limited, and the interaction between L1 and L2 will lead to a problem of insufficient capacity. The computing device 201 utilizes the relatively large capacity of the LLC to cache high-precision data to save time for repeated access.

图3示出计算装置201的内部结构示意图，其包括核内存代理器301、核控制器302及处理阵列303。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 , which includes acore memory agent 301 , acore controller 302 and aprocessing array 303 .

核内存代理器301作为计算装置201访问片外内存203的管理端。当核内存代理器301自片外内存203读取操作数时，操作数的起始地址在核内存代理器301中被设置，核内存代理器301通过自增加地址来同时、连续、串行读取多个操作数，其读取方式是一次性地自这些操作数的低位逐次往高位读取，例如需要读取3个操作数时，先根据各操作数的起始地址串行读取第一操作数的最低位512比特，再串行读取第二操作数的低位512比特，接着串行读取第三操作数的低位512比特，最低位读取完成后，通过自增加地址(增加512比特)，接着串行读取各次低位512比特，依此方式直到这3个操作数的最高位被读取。当核内存代理器301将计算结果存储回片外内存203时，则以并行发送，例如核内存代理器301需要发送3个计算结果至片外内存203，则同时发送这3个计算结果的最低位比特，再同时发送这3个计算结果的次低位比特，依此方式直到这3个计算结果的最高位比特同时发送完毕。一般来说，这些操作数是以矩阵或向量的形式来表示的。Thekernel memory agent 301 serves as a management terminal for the computing device 201 to access the off-chip memory 203 . When thekernel memory agent 301 reads the operand from the off-chip memory 203, the starting address of the operand is set in thekernel memory agent 301, and thekernel memory agent 301 reads simultaneously, continuously and serially by self-increasing addresses To take multiple operands, the reading method is to read from the lower bits of these operands to the higher bits one by one. The lowest 512 bits of an operand, then serially read the lower 512 bits of the second operand, and then serially read the lower 512 bits of the third operand, after the lowest reading is completed, through the self-increment address (increase 512 bits), and then serially read the lower 512 bits each time, and so on until the highest bits of the three operands are read. When thekernel memory agent 301 stores the calculation results back to the off-chip memory 203, then send them in parallel. For example, thekernel memory agent 301 needs to send three calculation results to the off-chip memory 203, and then send the lowest value of the three calculation results at the same time. Bits, and then send the second-lowest bits of the three calculation results at the same time, and in this way until the highest-order bits of the three calculation results are sent at the same time. Typically, these operands are represented in the form of matrices or vectors.

核控制器302基于处理阵列303中的处理部件的运算能力与数量，控制将每个操作数拆分成多个数据段，也就是多个向量，使得核内存代理器301以数据段为单位发送至处理阵列303。Thecore controller 302 is based on the computing power and quantity of the processing components in theprocessing array 303, and controls to split each operand into multiple data segments, that is, multiple vectors, so that thecore memory agent 301 sends data segments in units of toprocessing array 303 .

处理阵列303用以执行两个操作数的乘法计算，举例来说，第一操作数可以拆分成x₀至x₇等8个数据段，第二操作数可以拆分成y₀至y₃等4个数据段，当第一操作数与第二操作数执行乘法运算时，算法展开如图4所示。处理阵列303便是通过拆分第一操作数与第二操作数，分别进行内积计算，再将中间结果401、402、403及404移位对齐加总，以获得乘法运算的计算结果。Theprocessing array 303 is used to perform the multiplication calculation of two operands. For example, the first operand can be divided into 8 data segments such as x₀ to x₇ , and the second operand can be divided into y₀ to y₃ Waiting for 4 data segments, when the multiplication operation is performed between the first operand and the second operand, the algorithm unfolds as shown in Figure 4. Theprocessing array 303 divides the first operand and the second operand, performs inner product calculation respectively, and then shifts, aligns and sums theintermediate results 401 , 402 , 403 and 404 to obtain the calculation result of the multiplication operation.

为清楚地阐述技术方案，以下统一将上述数据段视为向量来表示，两数据段相乘即是两向量(第一向量及第二向量)做内积，其中第一向量来自第一操作数，第二向量来自第二操作数。In order to clearly explain the technical solution, the above-mentioned data segments are collectively represented as vectors below, and the multiplication of two data segments is the inner product of two vectors (the first vector and the second vector), wherein the first vector comes from the first operand , the second vector from the second operand.

处理阵列303包括多个处理部件304，这些处理部件304以阵列方式排列，图中示例性展示4×8个处理部件304，本发明不限制处理部件304的个数。每个处理部件304用以根据第一向量的长度及第二向量的长度，内积第一向量与第二向量，以获得内积结果。最后，核控制器302控制内存代理器301将内积结果整合或归约成多个操作数的计算结果，发送给核内存代理器301，核内存代理器301将计算结果存储至片外内存203。Theprocessing array 303 includes a plurality ofprocessing units 304 arranged in an array. The figure shows 4×8processing units 304 as an example, and the number of theprocessing units 304 is not limited in the present invention. Eachprocessing unit 304 is configured to inner product the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. Finally, thecore controller 302 controls thememory proxy 301 to integrate or reduce the inner product result into calculation results of multiple operands and send them to thecore memory proxy 301, and thecore memory proxy 301 stores the calculation results in the off-chip memory 203 .

具体来说，计算装置201在控制上采用递推分解算法(recursivedecomposition)，当计算装置201接收到来自处理装置202的指令来执行任意精度计算时，核控制器302将乘法的操作数平均拆分为多个向量，并将它们发送到处理阵列303进行计算，每个处理部件304负责一组向量的计算，例如第一向量与第二向量的内积。在此实施例中，每个处理部件304会基于本身的硬件资源，将一组向量进一步拆分成更小的内积计算单元，以方便进行内积计算。计算装置201在数据路径上采用多比特流，即每个操作数以每周期1比特的速度从核内存代理器301导入处理部件303，但多个操作数同时并行传输，在计算结束后，处理部件304以比特串行方式发送内积结果到核内存代理器301。Specifically, computing device 201 adopts a recursive decomposition algorithm (recursive decomposition) in control. When computing device 201 receives an instruction from processing device 202 to perform arbitrary precision calculation,core controller 302 divides the operands of multiplication into equal parts. For multiple vectors, and send them to theprocessing array 303 for calculation, eachprocessing unit 304 is responsible for the calculation of a group of vectors, such as the inner product of the first vector and the second vector. In this embodiment, eachprocessing unit 304 further splits a group of vectors into smaller inner product calculation units based on its own hardware resources, so as to facilitate inner product calculation. Computing device 201 adopts multi-bit streams on the data path, that is, each operand is imported fromcore memory agent 301 toprocessing unit 303 at a speed of 1 bit per cycle, but multiple operands are transmitted in parallel at the same time. After the calculation is completed, the processing Thecomponent 304 sends the inner product result to thekernel memory agent 301 in a bit-serial manner.

作为计算装置201的核心计算单元，处理部件304的主要任务是内积计算。处理部件304是基于比特索引向量内积的流程分成3个阶段来处理，第一阶段为模式生成阶段，第二阶段为模式索引阶段，第三阶段为加权合成阶段。As the core calculation unit of the calculation device 201, the main task of theprocessing unit 304 is inner product calculation. Theprocessing unit 304 divides the process into three stages based on the inner product of bit index vectors. The first stage is the pattern generation stage, the second stage is the pattern index stage, and the third stage is the weighted synthesis stage.

以第一向量

与第二向量

的内积为例，假设第一向量

与第二向量

的大小分别为N×p_x与N×p_y，其中N为第一向量

与第二向量

的长度，更详细来说是行元素数量，p_x为第一向量

的位宽，p_y为第二向量

的位宽。在此实施例中，欲进行第一向量

与第二向量

的内积，先将第一向量

转置，再与第二向量

做内积，即(p_x×N)·(N×p_y)，以生成p_x×p_y的内积结果。Take the first vector

with the second vector

As an example, assuming that the first vector

with the second vector

The sizes are N×p_x and N×p_y , where N is the first vector

with the second vector

The length of, more specifically, the number of row elements, p_x is the first vector

The bit width of , p_y is the second vector

bit width. In this example, the first vector

with the second vector

The inner product of , the first vector

transpose, and then with the second vector

Do the inner product, ie (p_x ×N)·(N×_py ) to generate the inner product result of p_x ×p_y .

此实施例将第二向量

拆解为：This example converts the second vector

Dismantled as:

其中K是一个固定不变且大小为N×2^N二进制矩阵，B_col是一个大小为2^N×p_y的二进制矩阵，C是p_y加权向量。Among them, K is a fixed binary matrix with a size of N×2^N , B_col is a binary matrix with a size of 2^N ×p_y , and C is a weighted vector of p_y .

在第一向量

的长度方向上各元素的排列共有2^N种模式，以N为2来说，即第一向量

的长度为2，K根据第一向量

的长度分为2^N个单位向量，以排列出长度为2的所有可能单位向量，因此K为大小为2×2²的二进制矩阵，用以涵盖所有长度为2的元素组合的所有可能性，长度为2的元素组合有

等4种可能性，故K的固定形式为：in the first vector

There are 2^N patterns in the arrangement of each element in the length direction of , taking N as 2, that is, the first vector

The length is 2, K according to the first vector

The length of is divided into 2^N unit vectors to arrange all possible unit vectors with a length of 2, so K is a binary matrix with a size of 2×2² to cover all possibilities of combinations of elements with a length of 2, A combination of elements of length 2 has

4 possibilities, so the fixed form of K is:

换言之，一旦第一向量

与第二向量

的长度确定了，K的大小及元素值便确定了。In other words, once the first vector

with the second vector

The length of K is determined, and the size and element value of K are determined.

B_col是一位有效向量(one-hot vector)，每一列只有1个元素为1，其余元素为0，而哪个元素为1取决于第二向量

的该列对应至K的哪列。为方便说明，示例性地设定第一向量

与第二向量

为：B_col is an effective vector (one-hot vector), each column has only one element as 1, and the rest of the elements are 0, and which element is 1 depends on the second vector

This column of corresponds to which column of K. For the convenience of illustration, the first vector is exemplarily set

with the second vector

for:

将第二向量

与K进行比较可以发现，第二向量

的第一列

为K的第四列，第二向量

的第二列

为K的第三列，第二向量

的第三列

为K的第四列，第二向量

的第四列

为K的第一列，故当第二向量

以K·B_col来表示时，B_col为大小为2²×4的索引矩阵如下：the second vector

Compared with K, it can be found that the second vector

first column of

is the fourth column of K, the second vector

the second column of

is the third column of K, the second vector

the third column of

is the fourth column of K, the second vector

the fourth column of

is the first column of K, so when the second vector

When represented by K·B_col , B_col is an index matrix with a size of 2² ×4 as follows:

B_col的第一列只有第四个元素为1，表示第二向量

的第一列为K的第四列；B_col的第二列只有第三个元素为1，表示第二向量

的第二列为K的第三列；B_col的第三列只有第四个元素为1，表示第二向量

的第三列为K的第四列；B_col的第四列只有第一个元素为1，表示第二向量

的第四列为K的第一列。综上所述，只要K确定了，B_col的元素值也确定了。Only the fourth element of the first column of B_col is 1, indicating the second vector

The first column of K is the fourth column of K; the second column of B_col only has the third element as 1, indicating the second vector

The second column of K is the third column of K; only the fourth element of the third column of B_col is 1, indicating the second vector

The third column of K is the fourth column of K; only the first element of the fourth column of B_col is 1, indicating the second vector

The fourth column of is the first column of K. To sum up, as long as K is determined, the element value of B_col is also determined.

C是p_y加权向量，用以反映第二向量

的幂次，也就是位宽。由于p_y为4，表示第二向量

的幂次为4，故C为：C is the p_y weighting vector to reflect the second vector

to the power of , that is, the bit width. Since p_y is 4, the second vector

The power of is 4, so C is:

此实施例通过上述的方式来拆解第二向量

使得第二向量

中的各元素可以用K与B_col两个二进制矩阵来表示。换言之，此实施例将

的内积运算转换成

的运算。This embodiment disassembles the second vector in the above-mentioned way

such that the second vector

Each element in can be represented by two binary matrices K and B_col . In other words, this embodiment will

The inner product operation is converted into

operation.

处理部件304便是用以基于前述的转换来实现向量内积

的。在模式生成阶段，处理部件304获得

的各种可能性，即生成模式向量

在模式索引阶段，处理部件304计算

在加权合成阶段，处理部件304根据权重C来累积索引模式。如此的设计使得不论精度多高的操作数都能转换成索引模式执行内积来减少重复计算，以避免任意精度计算对高带宽的要求。Theprocessing unit 304 is used to implement the vector inner product based on the aforementioned conversion

of. During the schema generation phase, theprocessing component 304 obtains

The various possibilities of generating pattern vectors

During the schema indexing phase, theprocessing unit 304 calculates

In the weighted synthesis stage, theprocessing component 304 accumulates index patterns according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations.

图3进一步示出处理部件304的结构示意图。为实现前述3个阶段，处理部件304包括处理部件内存代理单元305、处理部件控制单元306、转换单元307、多个内积单元308及合成单元309。FIG. 3 further shows a schematic structural diagram of theprocessing unit 304 . To realize the aforementioned three stages, theprocessing unit 304 includes a processing unitmemory proxy unit 305 , a processingunit control unit 306 , aconversion unit 307 , a plurality ofinner product units 308 and asynthesis unit 309 .

处理部件内存代理单元305作为处理部件304访问核内存代理器301的接口端，用以接收需要进行内积运算的两向量，例如前述的第一向量

与第二向量

The processing unitmemory proxy unit 305 is used as the interface for theprocessing unit 304 to access thekernel memory proxy 301 to receive the two vectors that need to be inner producted, such as the aforementioned first vector

with the second vector

处理部件控制单元306用以协调并管理处理部件304中各单元的工作。The processingunit control unit 306 is used to coordinate and manage the work of each unit in theprocessing unit 304 .

转换单元307用以实现模式生成阶段。自处理部件内存代理单元305接收第一向量

并以硬件实现二进制矩阵K，执行

以生成多个模式向量

图5示出转换单元307的示意图，转换单元307包括：N个比特流输入端501、生成组件502及2^N个比特流输出端503。Theconversion unit 307 is used to implement the pattern generation stage. Receive the first vector from the processing elementmemory proxy unit 305

And implement the binary matrix K in hardware, execute

to generate multiple pattern vectors

FIG. 5 shows a schematic diagram of theconversion unit 307 , and theconversion unit 307 includes: N bitstream input terminals 501 , ageneration component 502 and 2^N bitstream output terminals 503 .

N个比特流输入端501用以对应至第一向量

的长度N，分别接收N个数据向量。图5以第一向量

的长度为4进行说明，第一向量

包括x₀、x₁、x₂、x₃等4个数据向量，每个数据向量的位宽为p_x，也就是每个数据向量具有p_x个位数。Nbitstream input terminals 501 are used to correspond to the first vector

The length N, receive N data vectors respectively. Figure 5 takes the first vector

The length is 4 to illustrate that the first vector

It includes four data vectors such as x₀ , x₁ , x₂ , and x₃ , and the bit width of each data vector is p_x , that is, each data vector has p_x digits.

生成组件502为执行

的核心元件。响应K具有2^N个单位向量，生成组件502包括2^N个生成单元，每个生成单元模拟一个单位向量，以分别生成2^N个模式向量

如图5所示，第一向量

拆分成x₀、x₁、x₂、x₃等4个数据向量，并行地自生成组件502的左侧输入。由于内积运算在二进制中其实就是各比特做加法运算，故生成组件502在硬件上直接模拟K中的所有单位向量，与x₀、x₁、x₂、x₃的各比特依序相加。更详细来说，每一周期同时输入x₀、x₁、x₂、x₃的同位比特，例如第一周期同时输入x₀、x₁、x₂、x₃的最低位比特，第二周期同时输入x₀、x₁、x₂、x₃的次低位比特，以此方式直到第p_x周期同时输入x₀、x₁、x₂、x₃的最高位比特为止。所需带宽每周期仅为N比特，在此例子中所需带宽每周期仅为4比特。Generatecomponent 502 for execution

core components. The response K has 2^N unit vectors, and thegenerating component 502 includes 2^N generating units, each of which simulates a unit vector to generate 2^N pattern vectors respectively

As shown in Figure 5, the first vector

Split into four data vectors such as x₀ , x₁ , x₂ , and x₃ , and input from the left side of thegeneration component 502 in parallel. Since the inner product operation in binary is actually an addition operation of each bit, thegeneration component 502 directly simulates all unit vectors in K on the hardware, and adds them to the bits of x₀ , x₁ , x₂ , and x₃ in sequence . In more detail, the parity bits of x₀ , x₁ , x₂ , and x₃ are input at the same time in each cycle, for example, the least significant bits of x₀ , x₁ , x₂ , and x₃ are input at the same time in the first cycle, and in the second cycle The next low-order bits of x₀ , x₁ , x₂ , and x₃ are simultaneously input, and in this way until the highest-order bits of x₀ , x₁ , x₂ , and x₃ are simultaneously input in the p_x -th cycle. The required bandwidth is only N bits per cycle, which in this example is only 4 bits per cycle.

在第一向量

的长度为4的情况下，生成组件502包括16个生成单元，分别模拟K中的16个单位向量，这些单元向量为(0000)、(0001)、(0010)、(0011)、(0100)、(0101)、(0110)、(0111)、(1000)、(1001)、(1010)、(1011)、(1100)、(1101)、(1110)及(1111)。in the first vector

When the length of is 4, thegenerating component 502 includes 16 generating units, respectively simulating 16 unit vectors in K, and these unit vectors are (0000), (0001), (0010), (0011), (0100) , (0101), (0110), (0111), (1000), (1001), (1010), (1011), (1100), (1101), (1110) and (1111).

图6示出单位向量为(1011)的生成单元504的示意图。以生成单元504为例，其模拟的是单位向量(1011)，故生成单元504包括3个元素暂存器601、加法器602及进位暂存器603。3个元素暂存器601接收并暂存数据向量对应至所模拟的单位向量的比特值，也就是x₀、x₁、x₃的比特值，直接忽略x₂的比特值，以此结构来实现：FIG. 6 shows a schematic diagram of thegeneration unit 504 with a unit vector of (1011). Taking thegeneration unit 504 as an example, what it simulates is a unit vector (1011), so thegeneration unit 504 includes threeelement registers 601, anadder 602, and acarry register 603. The threeelement registers 601 receive and temporarily The stored data vector corresponds to the bit value of the simulated unit vector, that is, the bit value of x₀ , x₁ , and x₃ , and the bit value of x₂ is directly ignored, and this structure is implemented:

暂存器601中的数值会被送至加法器602进行累加，累加后如果出现进位，则进位的数值被暂存在进位暂存器603，与下一周期输入的x₀、x₁、x₃的比特值相加，直到第p_x周期将x₀、x₁、x₃的最高位比特相加为止。每个生成单元都根据同样的技术逻辑进行设计，本领域技术人员基于图6中实现单位向量为(1011)的生成单元504的结构，无须创造性劳动便可轻易推及其他生成单元的结构，故不赘述。需特别注意的是，有些生成单元无需设置加法器602与进位暂存器603，例如模拟单元向量(0000)、(0001)、(0010)、(0100)及(1000)的生成单元，这些生成单元在同一周期中仅有一个输入，不存在加法运算更不会发生进位的情况。The value in thetemporary register 601 will be sent to theadder 602 for accumulation. If a carry occurs after the accumulation, the value of the carry will be temporarily stored in thecarry register 603, and will be compared with the x₀ , x₁ , and x₃ input in the next cycle. The bit values ofx 0 , x 1 , and x 3 are added up until the p_xth cycle adds the most significant bits of x₀ , x₁ , and x₃ . Each generating unit is designed according to the same technical logic. Those skilled in the art can easily deduce the structure of other generating units without creative work based on the structure of generatingunit 504 with unit vector (1011) in FIG. 6 . I won't go into details. It should be noted that some generation units do not need to be provided withadder 602 and carryregister 603, such as the generation units of analog unit vectors (0000), (0001), (0010), (0100) and (1000), these generation units The unit has only one input in the same cycle, and there is no addition and no carry.

回到图5，2^N个比特流输出端503分别连接至每个生成单元的加法器602的输出，用以输出2^N个模式向量

在图5中，由于N为4，16个比特流输出端503总计输出16个模式向量

这些模式向量

的位宽有可能是p_x(如果最高位比特相加不进位)，或是p_x+1(如果最高位比特相加后进位)。从图5可以看出，模式向量

为x₀、x₁、x₂、x₃的所有加法运算可能性组合，即：Returning to Fig. 5, 2^N bitstream output ports 503 are respectively connected to the output of theadder 602 of each generation unit to output 2^N pattern vectors

In Fig. 5, since N is 4, 16 bitstream output terminals 503 output 16 pattern vectors in total

These pattern vectors

The bit width of may be p_x (if the highest bits are added without carry), or p_x +1 (if the highest bits are added and then carried). As can be seen from Figure 5, the pattern vector

is all possible addition combinations of x₀ , x₁ , x₂ , x₃ , namely:

z₀＝0z₀ =0

z₁＝x₀z₁ =x₀

z₂＝x₁z₂ =x₁

z₃＝x₀+x₁z₃ =x₀ +x₁

z₄＝x₂z₄ =x₂

z₅＝x₀+x₂z₅ =x₀ +x₂

z₆＝x₁+x₂z₆ =x₁ +x₂

z₇＝x₀+x₁+x₂z₇ =x₀ +x₁ +x₂

z₈＝x₃z₈ =x₃

z₉＝x₀+x₃z₉ =x₀ +x₃

z₁₀＝x₁+x₃z₁₀ =x₁ +x₃

z₁₁＝x₀+x₁+x₃z₁₁ =x₀ +x₁ +x₃

z₁₂＝x₂+x₃z₁₂ =x₂ +x₃

z₁₃＝x₀+x₂+x₃z₁₃ =x₀ +x₂ +x₃

z₁₄＝x₁+x₂+x₃z₁₄ =x₁ +x₂ +x₃

z₁₅＝x₀+x₁+x₂+x₃z₁₅ =x₀ +x₁ +x₂ +x₃

模式向量

被发送至内积单元308，此实施例的内积单元308有多个，每个内积单元308相当于一个处理器核，用以实现模式索引阶段与加权合成阶段，本发明不限制内积单元308的数量。内积单元308自处理部件内存代理单元305接收第二向量

以第二向量

的长度方向上的数据向量为索引，根据每个索引从所有的模式向量

中选择对应的特定模式向量，累加这些特定模式向量，在每个周期生成一比特的中间结果，在连续的p_x或p_x+1个周期形成单位累加数列。上述运算便是在执行

pattern vector

It is sent to theinner product unit 308. There are multipleinner product units 308 in this embodiment, and eachinner product unit 308 is equivalent to a processor core for realizing the mode index stage and the weighted synthesis stage. The present invention does not limit the inner product The number ofunits 308 . Theinner product unit 308 receives the second vector from the processing elementmemory proxy unit 305

take the second vector

The data vectors in the length direction are indices, according to each index from all pattern vectors

Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p_x or p_x +1 cycles. The above operation is performed

图7示出此实施例的内积单元308的示意图。为了实现

内积单元308包括p_y个多路复用器701及p_y-1个串行全加器702。FIG. 7 shows a schematic diagram of theinner product unit 308 of this embodiment. In order to achieve

Theinner product unit 308 includes p_y multiplexers 701 and p_y −1 serialfull adders 702 .

p_y个多路复用器701用以实现模式索引阶段。每个多路复用器701接收所有的模式向量

(z₀至z₁₅)，根据第二向量

的长度方向上的同位数据向量让所有模式向量

中的特定模式向量通过。由于第二向量

的长度为N，故第二向量

可以拆解成N个数据向量，由于N为4，因此第二向量

可以拆解成y₀、y₁、y₂、y₃等4个数据向量，且每个数据向量的位宽为p_y，因此这些数据向量以同位比特的角度来看可以拆解成p_y个同位数据向量。举例来说，y₀、y₁、y₂、y₃等4个数据向量的最高位比特形成最高位同位数据向量703，y₀、y₁、y₂、y₃等4个数据向量的次高位比特形成次高位同位数据向量704，以此类推，y₀、y₁、y₂、y₃等4个数据向量的最低位比特形成最低位同位数据向量705。p_y multiplexers 701 are used to implement the pattern indexing stage. Eachmultiplexer 701 receives all pattern vectors

(z₀ to z₁₅ ), according to the second vector

The co-located data vectors in the length direction of let all pattern vectors

The specific pattern vector in the pass. Since the second vector

The length of is N, so the second vector

It can be disassembled into N data vectors. Since N is 4, the second vector

It can be disassembled into 4 data vectors such as y₀ , y₁ , y₂ , and y₃ , and the bit width of each data vector is p_y , so these data vectors can be disassembled into p_y from the perspective of parity bits co-located data vectors. For example, the most significant bits of the 4 data vectors such as y₀ , y₁ , y₂ , and y₃ form the highest bitparity data vector 703, and the order of the 4 data vectors such as y₀ , y₁ , y₂ , and y₃ The high-order bits form the next-highestparity data vector 704 , and so on, the lowest-order bits of the four data vectors y₀ , y₁ , y₂ , and y₃ form the lowest-orderparity data vector 705 .

多路复用器701判断输入的同位数据向量与二进制矩阵K的哪个单位向量相同，输出相同单位向量所对应的特定模式向量。例如，最高位同位数据向量703作为选择信号输入至第一多路复用器，假设最高位同位数据向量703为(0101)，与图5中的单位向量505相同，则第一多路复用器将输出与单位向量505相对应的特定模式向量z₅。再例如，次高位同位数据向量704作为选择信号输入至第二多路复用器，假设次高位同位数据向量704为(0010)，与图5中的单位向量506相同，则第二多路复用器将输出与单位向量506相对应的特定模式向量z₂。最后，最低位同位数据向量705作为选择信号输入至第p_y多路复用器，假设最低位同位数据向量705为(1110)，与图5中的单位向量507相同，则第p_y多路复用器将输出与单位向量507相对应的特定模式向量z₁₄。至此完成

的运算。Themultiplexer 701 judges which unit vector of the binary matrix K is the same as the input data vector of the same position, and outputs a specific pattern vector corresponding to the same unit vector. For example, the highest bitparity data vector 703 is input to the first multiplexer as a selection signal, assuming that the highest bitparity data vector 703 is (0101), which is the same as theunit vector 505 in Fig. 5, then the first multiplexer The converter will output a specific pattern vector z₅ corresponding to theunit vector 505 . For another example, the second highest bitparity data vector 704 is input to the second multiplexer as a selection signal, assuming that the second highest bitparity data vector 704 is (0010), which is the same as theunit vector 506 in FIG. 5 , then the second multiplexer The user will output a specific pattern vector z₂ corresponding to theunit vector 506 . Finally, the lowest bitparity data vector 705 is input to the p_y multiplexer as a selection signal, assuming that the lowest bitparity data vector 705 is (1110), which is the same as theunit vector 507 in Figure 5, then the p_y multiplexer The multiplexer will output a specific pattern vector z₁₄ corresponding to theunit vector 507 . so far completed

operation.

串行全加器702实现加权合成阶段。p_y-1个串行全加器702依图中方式串行连接，接收多路复用器701输出特定模式向量，依序累加这些特定模式向量，以获得单位累加数列。需特别注意的是，为符合从低位开始累加并进位(如有)至下一位使得下一位可以正确地累加并进位，最低位同位数据向量705所对应的特定模式向量必须安排输入至最外侧的串行全加器702，使得低位的同位数据向量所对应的特定模式向量优先被累加，越高位的同位数据向量所对应的特定模式向量则安排输入至越内侧的串行全加器702，最高位同位数据向量703所对应的特定模式向量必须安排输入至最内侧的串行全加器702，使得越高位的同位数据向量所对应的特定模式向量越滞后被累加，如此才能确保累加的正确性，也就是依照p_y加权向量C以反映第二向量

的幂次。单位累加数列是在

的基础上进一步实现C的加权。至此获得如图4中的中间结果401、402、403及404。Serialfull adder 702 implements the weighted synthesis stage. p_y −1 serialfull adders 702 are connected in series as shown in the figure, and the receivingmultiplexer 701 outputs specific pattern vectors, and these specific pattern vectors are accumulated sequentially to obtain a unit accumulation sequence. It should be noted that, in order to comply with accumulation and carry (if any) from the lower bit to the next bit so that the next bit can be correctly accumulated and carried, the specific mode vector corresponding to the lowest bitparity data vector 705 must be arranged to be input to the lowest bit. The outer serialfull adder 702 enables the specific pattern vectors corresponding to the low-order same-bit data vectors to be accumulated preferentially, and the specific pattern vectors corresponding to the higher-order same-bit data vectors are arranged to be input to the inner serialfull adder 702 , the specific pattern vector corresponding to the highest-order data vector 703 must be arranged to be input to the innermost serialfull adder 702, so that the specific pattern vector corresponding to the higher-order data vector is accumulated more laggingly, so as to ensure the accumulation Correctness, that is, weighting the vector C according to p_y to reflect the second vector

to the power of . The accumulative sequence of units is in

The weighting of C is further realized on the basis of . So far,

intermediate results

401 , 402 , 403 and 404 as shown in FIG. 4 are obtained.

合成单元309用以执行如图4中的加总计算405。合成单元309接收来自各个内积单元308的单位累加数列，每个单位累加数列就如同图4中的中间结果401、402、403及404，这些中间结果在内积单元308中已对齐，接着合成单元309加总这些对齐后的单位累加数列，进而获得第一向量

与第二向量

的内积结果。The combiningunit 309 is used for performing thesum calculation 405 as shown in FIG. 4 . Combiningunit 309 receives the unit accumulation sequence from eachinner product unit 308, each unit accumulation sequence is like

intermediate results

401, 402, 403 and 404 in FIG. 4, these intermediate results have been aligned ininner product unit 308, and then synthesizedUnit 309 sums up these aligned unit accumulation sequences to obtain the first vector

with the second vector

inner product result.

图8示出此实施例的合成单元309的示意图。图中的合成单元309示例性地接收8个内积单元308的输出，即单位累加数列801至808。这些单位累加数列801至808是第一向量

与第二向量

拆分成8个数据段后，分别交由8个内积单元308进行内积计算所得的中间结果。合成单元309包括7个全加器组809至815。由于最低位的运算816与最高位的运算817仅有一个中间结果，因此最低位的运算816与最高位的运算817不需要加法器组，如同图4中的x₀y₀(最低位)及x₇y₃(最高位)，无需和其他中间结果相加，直接输出即可。换言之，只有次低位至次高位的运算需要全加器组，以执行如图4所示的加总计算405。FIG. 8 shows a schematic diagram of thesynthesis unit 309 of this embodiment. Thesynthesis unit 309 in the figure exemplarily receives the outputs of 8inner product units 308 , that is, theunit accumulation sequence 801 to 808 . Theseunit accumulation sequences 801 to 808 are the first vector

with the second vector

After splitting into 8 data segments, the intermediate results obtained by the inner product calculation are handed over to the eightinner product units 308 respectively. Thesynthesis unit 309 includes sevenfull adder groups 809 to 815 . Since thelowest bit operation 816 and thehighest bit operation 817 have only one intermediate result, thelowest bit operation 816 and thehighest bit operation 817 do not need an adder group, as in Fig. 4 x₀ y₀ (lowest bit) and x₇ y₃ (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second-lowest bit to the second-highest bit requires the set of full adders to perform the summingcalculation 405 shown in FIG. 4 .

图9示出全加器组810至815的示意图。全加器组810至815包括第一全加器901与第二全加器902，第一全加器901与第二全加器902分别包括多路复用器903及904，其中多路复用器903的输入端连接加法器的进位输出与数值0，多路复用器904的输入端连接加法器的进位输出与数值1，该数值0与1分别用以模拟前一位数的中间结果加总后未进位与进位，故第一全加器901用以生成前一位数未进位的中间结果总和，第二全加器902用以生成前一位数进位的中间结果总和。这样的结构可以不用等待前一位数的中间结果来决定是否进位，此实施例改采同步计算未进位与进位的设计能降低运算延迟时间。全加器组810至815还包括多路复用器905，两个中间结果总和均输入至多路复用器905，多路复用器905会根据前一位数的计算结果是否进位，来选择输出进位的中间结果总和或是未进位的中间结果总和。累加过后的输出818即为第一向量

与第二向量

的内积结果。FIG. 9 shows a schematic diagram of the bank offull adders 810 to 815 . Thefull adder group 810 to 815 includes a firstfull adder 901 and a secondfull adder 902, and the firstfull adder 901 and the secondfull adder 902 include

multiplexers

903 and 904 respectively, wherein the multiplexer The input terminal of themultiplier 903 is connected to the carry output of the adder and thevalue 0, and the input terminal of themultiplexer 904 is connected to the carry output of the adder and the value 1, and thevalues 0 and 1 are respectively used to simulate the middle of the previous digit. The results are summed without carry and carry, so the firstfull adder 901 is used to generate the sum of the intermediate results of the previous digit without carry, and the secondfull adder 902 is used to generate the sum of the intermediate results of the previous digit without carry. Such a structure can decide whether to carry without waiting for the intermediate result of the previous digit. In this embodiment, the design of synchronously calculating the non-carry and carry can reduce the operation delay time. Thefull adder group 810 to 815 also includes amultiplexer 905, the sum of the two intermediate results is input to themultiplexer 905, and themultiplexer 905 will select according to whether the calculation result of the previous digit is carried. Output the carried sum of intermediate results or the uncarried sum of intermediate results. The accumulatedoutput 818 is the first vector

with the second vector

inner product result.

回到图8，由于最低位的运算不可能产生进位，因此次低位的全加器组809仅包括第一全加器901，直接生成未进位的中间结果，无需设置第二全加器902及多路复用器905。Returning to Fig. 8, since the operation of the lowest bit is impossible to generate a carry, the next-lowestfull adder group 809 only includes the firstfull adder 901, which directly generates the intermediate result without carrying out, without setting the secondfull adder 902 andmultiplexer 905 .

根据图8、图9及其相关说明，当此实施例的合成单元309欲加总M个单位累加数列时，将配置M-1个全加器组，其中包括M-1个第一全加器901、M-2个第二全加器902及M-2个多路复用器905。According to Fig. 8, Fig. 9 and related explanations, when thesynthesis unit 309 of this embodiment intends to sum up M unit accumulation sequences, M-1 full adder groups will be configured, including M-1 firstfull adders 901, M-2 secondfull adders 902 and M-2multiplexers 905.

在其他情况下，合成单元309可以弹性选择开启或关闭全加器组的运作，例如第一向量

与第二向量

所产生的单位累加数列小于M个时，便可适当关闭特定数量的全加器组，以灵活地支持各种可能的拆分数量，扩大合成单元309的应用场景。In other cases, thesynthesis unit 309 can flexibly choose to enable or disable the operation of the full adder group, for example, the first vector

with the second vector

When the generated unit accumulation sequences are less than M, a specific number of full adder groups can be properly closed to flexibly support various possible split numbers and expand the application scenarios of the combiningunit 309 .

回到图3，在合成单元309获得第一向量

与第二向量

的内积结果后，发送至处理部件内存代理单元305，处理部件内存代理单元305接收内积结果并将其发送至核内存代理器301，核内存代理器301将整合所有处理部件304的内积结果，以生成计算结果，发送至片外内存203，以完成第一操作数与第二操作数的乘积运算。Returning to Fig. 3, the first vector is obtained in thesynthesis unit 309

with the second vector

After the inner product result, it is sent to the processing unitmemory proxy unit 305, and the processing unitmemory proxy unit 305 receives the inner product result and sends it to thekernel memory proxy 301, and thekernel memory proxy 301 will integrate the inner product of all processingunits 304 As a result, the calculation result is generated and sent to the off-chip memory 203 to complete the product operation of the first operand and the second operand.

基于上述的结构，此实施例的计算装置201根据操作数的长度执行不同数量的内积运算。进一步地，处理阵列303可以控制索引在纵向的处理部件304间共享，并控制模式向量在横向的处理部件304间共享，以高效地进行运算。Based on the above structure, the computing device 201 of this embodiment performs different numbers of inner product operations according to the length of the operands. Further, theprocessing array 303 can control the index to be shared among thevertical processing units 304 , and control the pattern vector to be shared among thehorizontal processing units 304 , so as to perform operations efficiently.

在数据路径管理上，此实施例采用两级架构，即核内存代理器301和处理部件内存代理单元305。操作数在LLC中的起始地址记录于核内存代理器301中，核内存代理器301通过自增加地址来同时、连续、串行自LLC读取多个操作数。源地址是自增长的，因此数据块的顺序是确定的。核控制器302决定哪些处理部件304接收数据块，处理部件控制单元306再决定哪些内积单元308接收这些数据块。In terms of data path management, this embodiment adopts a two-level architecture, that is, acore memory agent 301 and a processing componentmemory agent unit 305 . The starting address of the operand in the LLC is recorded in thekernel memory agent 301, and thekernel memory agent 301 simultaneously, continuously, and serially reads multiple operands from the LLC by self-increasing the address. The source address is self-increasing, so the order of data blocks is deterministic. Thecore controller 302 determines which processingelements 304 receive the data blocks, and the processingelement control unit 306 then determines whichinner product units 308 receive the data blocks.

本发明的另一个实施例是一种任意精度计算方法，可以利用前述实施例的硬件结构来实现。图10示出此实施例的流程图。Another embodiment of the present invention is an arbitrary precision calculation method, which can be realized by using the hardware structure of the foregoing embodiments. Fig. 10 shows a flowchart of this embodiment.

在步骤1001中，自片外内存读取多个操作数。当自片外内存读取操作数时，操作数的起始地址在核内存代理器中被设置，核内存代理器通过自增加地址来同时、连续、串行读取多个操作数，其读取方式是一次性地自这些操作数的低位逐次往高位读取。Instep 1001, a plurality of operands are read from off-chip memory. When reading operands from the off-chip memory, the starting address of the operands is set in the kernel memory agent, and the kernel memory agent reads multiple operands simultaneously, continuously, and serially through self-increasing addresses. The fetching method is to read from the lower bits of these operands to the higher bits one by one.

在步骤1002中，将多个操作数拆分成多个向量，多个向量包括第一向量及第二向量。核控制器基于处理阵列中的处理部件的运算能力与数量，控制将每个操作数拆分成多个数据段，也就是多个向量，使得核内存代理器以数据段为单位发送至处理阵列。Instep 1002, multiple operands are split into multiple vectors, and the multiple vectors include a first vector and a second vector. Based on the computing power and quantity of processing components in the processing array, the core controller controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent sends data segments to the processing array .

在步骤1003中，根据第一向量及第二向量的长度，内积第一向量与第二向量，以获得内积结果。处理阵列包括多个处理部件，这些处理部件以阵列方式排列，每个处理部件根据第一向量的长度及第二向量的长度，内积第一向量与第二向量，以获得内积结果。更详细来说，在此步骤中，先执行模式生成阶段，再执行模式索引阶段，最后执行加权合成阶段。Instep 1003, according to the lengths of the first vector and the second vector, the first vector and the second vector are inner producted to obtain an inner product result. The processing array includes a plurality of processing units arranged in an array, and each processing unit inner-products the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. In more detail, in this step, the pattern generation stage is executed first, then the pattern index stage is executed, and finally the weighted synthesis stage is executed.

以第一向量

与第二向量

的内积为例，假设第一向量

与第二向量

的大小分别为N×p_x与N×p_y，其中N为第一向量

与第二向量

的长度，p_x为第一向量

的位宽，p_y为第二向量

的位宽。此实施例同样将第二向量

拆解为：Take the first vector

with the second vector

As an example, assuming that the first vector

with the second vector

The sizes are N×p_x and N×p_y , where N is the first vector

with the second vector

The length of p_x is the first vector

The bit width of , p_y is the second vector

bit width. This embodiment also converts the second vector

Dismantled as:

其中K是一个固定不变且大小为N×2^N二进制矩阵，B_col是一个大小为2^N×p_y的二进制矩阵，C是p_y加权向量，K、B_col、C的定义与前述实施例无异，故不赘述。此实施例通过上述的方式来拆解第二向量

使得第二向量

的内积运算转换成

的运算。Among them, K is a fixed binary matrix with a size of N×2^N , B_col is a binary matrix with a size of 2^N ×p_y , C is a weighted vector of p_y , and the definition of K, B_col and C is the same as the aforementioned implementation The examples are the same, so I won't go into details. This embodiment disassembles the second vector in the above-mentioned way

such that the second vector

The inner product operation is converted into

operation.

在模式生成阶段，此实施例获得

的各种可能性，即生成模式向量

在模式索引阶段，此实施例计算

在加权合成阶段，再根据权重C来累积索引模式。如此的设计使得不论精度多高的操作数都能转换成索引模式执行内积来减少重复计算，以避免任意精度计算对高带宽的要求。图11进一步示出内积第一向量与第二向量的流程图。During the schema generation phase, this embodiment obtains

The various possibilities of generating pattern vectors

During the schema indexing phase, this embodiment computes

In the weighted synthesis stage, the index patterns are accumulated according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations. FIG. 11 further shows a flowchart of the inner product of the first vector and the second vector.

在步骤1101中，根据第一向量的长度及位宽生成多个模式向量。首先，对应至第一向量

的长度N，分别接收N个数据向量。接着响应K具有2^N个单位向量，利用硬件模拟每个单位向量，以分别生成2^N个模式向量

由于内积运算在二进制中其实就是各比特做加法运算，故此实施例的生成组件直接模拟K中的所有单位向量，与第一向量

的数据向量的各比特依序相加。更详细来说，每一周期同时输入第一向量

的数据向量的同位比特，例如第一周期同时输入数据向量的最低位比特，第二周期同时输入数据向量的次低位比特，以此方式直到第p_x周期同时输入数据向量的最高位比特为止。所需带宽每周期仅为N比特。Instep 1101, a plurality of pattern vectors are generated according to the length and bit width of the first vector. First, corresponding to the first vector

The length N, receive N data vectors respectively. Then the response K has 2^N unit vectors, each unit vector is simulated by hardware to generate 2^N pattern vectors respectively

Since the inner product operation is actually the addition operation of each bit in binary, the generation component of this embodiment directly simulates all unit vectors in K, and the first vector

The bits of the data vector of are added sequentially. In more detail, each cycle simultaneously inputs the first vector

For example, the lowest bit of the data vector is input at the same time in the first cycle, and the second-lowest bit of the data vector is input at the same time in the second cycle, in this way until the highest bit of the data vector is input at the p_xth cycle at the same time. The required bandwidth is only N bits per cycle.

在模拟单位向量时，先接收并暂存对应至该单位向量的数据向量的比特值，这些比特值会被累加，累加后如果出现进位，则进位的数值被暂存在进位暂存器，与下一周期输入的数据向量的比特值相加，直到第p_x周期将数据向量的最高位比特值相加为止。When simulating a unit vector, first receive and temporarily store the bit values corresponding to the data vector of the unit vector, these bit values will be accumulated, if a carry occurs after accumulation, the value of the carry will be temporarily stored in the carry register, and the next The bit values of the data vectors input in one cycle are added until the p_xth cycle adds the most significant bit values of the data vectors.

最后，接收累加后的结果，即为模式向量

综上所述，模式向量

为第一向量

的数据向量的所有加法运算可能性组合。Finally, receive the accumulated result, which is the pattern vector

In summary, the pattern vector

is the first vector

All combinations of addition possibilities for a data vector of .

在步骤1102中，基于第二向量

在长度方向上的数据向量为索引，累加多个模式向量中的特定模式向量，以形成多个单位累加数列。此步骤实现模式索引阶段与加权合成阶段。以第二向量

Instep 1102, based on the second vector

The data vector in the length direction is used as an index, and specific pattern vectors among the plurality of pattern vectors are accumulated to form a plurality of unit accumulation sequence. This step implements the pattern indexing phase and the weighted synthesis phase. take the second vector

更详细来说，，根据第二向量

的长度方向上的同位数据向量让所有模式向量

中的特定模式向量通过。由于第二向量

的长度为N，故第二向量

可以拆解成N个数据向量，每个数据向量的位宽为p_y，因此这些数据向量以同位比特的角度来看可以拆解成p_y个同位数据向量。In more detail, according to the second vector

The co-located data vectors in the length direction of let all pattern vectors

The specific pattern vector in the pass. Since the second vector

The length of is N, so the second vector

It can be disassembled into N data vectors, and the bit width of each data vector is p_y , so these data vectors can be disassembled into p_y same data vectors from the perspective of the same bits.

接着，判断输入的同位数据向量与二进制矩阵K的哪个单位向量相同，输出相同单位向量所对应的特定模式向量。至此完成

的运算。Next, it is judged which unit vector of the binary matrix K is the same as the input data vector of the same position, and a specific pattern vector corresponding to the same unit vector is output. so far completed

operation.

最后，依序累加这些特定模式向量，以获得单位累加数列。需特别注意的是，应确保累加的正确性，也就是依照p_y加权向量C以反映第二向量

的幂次。单位累加数列是在

的基础上进一步实现C的加权。每个单位累加数列就如同图4中的中间结果401、402、403及404，这些中间结果已完成对齐。Finally, these specific pattern vectors are sequentially accumulated to obtain a unit accumulation sequence. Special attention should be paid to ensure the correctness of the accumulation, that is, to weight the vector C according to p_y to reflect the second vector

to the power of . The accumulative sequence of units is in

The weighting of C is further realized on the basis of . Each unit accumulation sequence is like the

intermediate results

401 , 402 , 403 and 404 in FIG. 4 , and these intermediate results have been aligned.

在步骤1103中，加总多个单位累加数列，以获得内积结果。为了实现同步计算，此实施例将第一向量

与第二向量

拆分成多个数据段后，分别进行内积计算所得的中间结果。由于最低位的运算与最高位的运算仅有一个中间结果，因此最低位的运算与最高位的运算不需进行加法运算，如同图4中的x₀y₀(最低位)及x₇y₃(最高位)，无需和其他中间结果相加，直接输出即可。换言之，只有次低位至次高位的运算需要执行加法运算。Instep 1103, a plurality of unit accumulation sequences are summed up to obtain an inner product result. In order to achieve synchronous calculation, this embodiment will first vector

with the second vector

After splitting into multiple data segments, the intermediate results obtained by performing inner product calculations respectively. Since there is only one intermediate result between the lowest bit operation and the highest bit operation, the lowest bit operation and the highest bit operation do not need to be added, just like x₀ y₀ (lowest bit) and x₇ y₃ in Figure 4 (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second lowest bit to the second highest bit needs to perform the addition operation.

此实施例采同步计算未进位与进位的设计以降低运算延迟时间。未进位与进位的中间结果总和同时获得，再根据前一位数的计算结果是否进位，来选择输出进位的中间结果总和或是未进位的中间结果总和。累加过后的输出即为第一向量

与第二向量

的内积结果。This embodiment adopts the design of synchronously calculating the non-carry and carry to reduce the operation delay time. The sum of intermediate results without carry and carry is obtained at the same time, and then the sum of intermediate results with carry or the sum of intermediate results without carry is selected according to whether the calculation result of the previous digit is carry. The accumulated output is the first vector

with the second vector

inner product result.

回到图10，在步骤1004中，将内积结果整合成多个操作数的计算结果。核控制器控制内存代理器将内积结果整合或归约成多个操作数的计算结果，发送给核内存代理器。Returning to Fig. 10, instep 1004, the inner product result is integrated into calculation results of multiple operands. The core controller controls the memory agent to integrate or reduce the inner product result into calculation results of multiple operands and send it to the kernel memory agent.

在步骤1005中，将计算结果存储至片外内存。核内存代理器并行发送计算结果，先同时发送这些计算结果的最低位比特，再同时发送这些计算结果的次低位比特，依此方式直到这些计算结果的最高位比特同时发送完毕。Instep 1005, the calculation result is stored in the off-chip memory. The kernel memory agent sends calculation results in parallel, first sending the lowest bits of these calculation results at the same time, and then sending the second-lowest bits of these calculation results at the same time, and in this way until the highest bits of these calculation results are sent at the same time.

本发明另一个实施例为一种计算机可读存储介质，其上存储有任意精度计算的计算机程序代码，当所述计算机程序代码由处理器运行时，执行如图10或图11的方法。在一些实现场景中，上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时，所述集成的单元可以存储在计算机可读取存储器中。基于此，当本发明的方案以软件产品(例如计算机可读存储介质)的形式体现时，该软件产品可以存储在存储器中，其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本发明实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Another embodiment of the present invention is a computer-readable storage medium, on which is stored computer program code for calculation with arbitrary precision. When the computer program code is run by a processor, the method shown in FIG. 10 or FIG. 11 is executed. In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present invention is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, server or network device, etc.) execute some or all of the steps of the method described in the embodiment of the present invention. The aforementioned memory may include but not limited to U disk, flash disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk, etc., which can store programs The medium of the code.

本发明提出一种新颖的架构，用以有效地处理任意精度计算。不论操作数的精度多高，本发明都可以将操作数进行拆解，利用索引并行处理固定长度的比特流，避免比特级冗余，像是稀疏性或重复计算等问题，无需配置高位宽的硬件，便可达到灵活运用和大位宽计算的效果。The present invention proposes a novel architecture to efficiently handle arbitrary precision calculations. No matter how high the precision of the operand is, the present invention can disassemble the operand, use the index to process the fixed-length bit stream in parallel, avoid bit-level redundancy, such as sparsity or repeated calculation, and do not need to configure a high-bit-width The hardware can achieve the effect of flexible application and large bit width calculation.

根据不同的应用场景，本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device of the present invention can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the solution of the present invention can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, from the hardware resources of the cloud device Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.

需要说明的是，为了简明的目的，本发明将一些方法及其实施例表述为一系列的动作及其组合，但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此，依据本发明的公开或教导，本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步，本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例，即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外，根据方案的不同，本发明对一些实施例的描述也各有侧重。鉴于此，本领域技术人员可以理解本发明某个实施例中没有详述的部分，也可以参见其他实施例的相关描述。It should be noted that, for the purpose of brevity, the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.

在具体实现方面，基于本发明的公开和教导，本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如，就前文所述的电子设备或装置实施例中的各个单元来说，本文在考虑了逻辑功能的基础上对其进行拆分，而实际实现时也可以有另外的拆分方式。又例如，可以将多个单元或组件结合或者集成到另一个系统，或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言，前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中，前述的直接或间接耦合涉及利用接口的通信连接，其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of the present invention, those skilled in the art can understand that several embodiments disclosed in the present invention can also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.

在本发明中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外，在一些场景中，本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present invention, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention. In addition, in some scenarios, multiple units in this embodiment of the present invention may be integrated into one unit, or each unit exists physically independently.

在另外一些实现场景中，上述集成的单元也可以采用硬件的形式实现，即为具体的硬件电路，其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件，而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此，本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现，例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步，前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等)，其例如可以是可变电阻式存储器(Resistive Random Access Memory，RRAM)、动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)、静态随机存取存储器(Static Random Access Memory，SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory，EDRAM)、高带宽存储器(High Bandwidth Memory，HBM)、混合存储器立方体(Hybrid Memory Cube，HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), a dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

以上对本发明实施例进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been described in detail above, and specific examples have been used in this paper to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only used to help understand the method and core idea of the present invention; at the same time, for Those skilled in the art will have changes in the specific implementation and scope of application according to the idea of the present invention. In summary, the contents of this specification should not be construed as limiting the present invention.