CN118349212A

Movatterモバイル変換

Info

Publication number: CN118349212A
Application number: CN202410469034.9A
Authority: CN
Inventors: 潘彪; 薛震宇; 张涵; 易文特; 张悦
Original assignee: Hefei Innovation Research Institute of Beihang University
Current assignee: Hefei Innovation Research Institute of Beihang University
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-07-16
Anticipated expiration: 2044-04-18
Also published as: CN118349212B

Abstract

The application discloses an in-memory computing method and a chip design, wherein the method comprises the following steps: the input value is stored in an input buffer module, and the weight value is stored in a static random access memory array; reading out the input value and the weight value and transmitting the input value and the weight value into a multiplier module; in the multiplier module, a multiplication accumulation result of the input value and the weight value is obtained through exclusive OR logic operation; in the adder tree module, adding and calculating the multiplication and accumulation result; and in the shift accumulation module, accumulating the calculation results of the adder tree module in a plurality of periods with the addition results of the corresponding offset values respectively to obtain output data. The application adopts a data flow reconfigurable scheme, can better schedule resources among different memory computing cores in the architecture layer, solves the computing problems of different precision, and greatly reduces the cost of power consumption, area and the like of a digital memory computing chip.

Description

Translated fromChinese

一种存内计算方法及芯片设计An in-memory computing method and chip design

技术领域Technical Field

本申请属于数据处理技术领域，尤其涉及一种存内计算方法及芯片设计。The present application belongs to the field of data processing technology, and in particular, relates to an in-memory computing method and chip design.

背景技术Background technique

近年来，大数据、大模型、VR/AR、自动驾驶、AIoT等领域大力发展，传统架构芯片算力不足、功耗高等问题亟待解决。而存算芯片由于其高带宽、低功耗、高能效比等优势；且相比于传统采用冯·诺依曼结构的芯片，其结构先天适合进行卷积计算、大量乘加运算等深度学习任务；如今，随着神经网络等领域的发展，各类AI应用场景对于存算芯片的要求进一步提高(例：云端深度学习需要高算力、高精度)；数字存算芯片由于其拥有更高更广的精度、能效比高、可靠性高、工艺成熟等特色，在此类应用场景中优于模拟存算芯片；In recent years, the fields of big data, large models, VR/AR, autonomous driving, AIoT, etc. have developed vigorously, and the problems of insufficient computing power and high power consumption of traditional architecture chips need to be solved urgently. Storage and computing chips have advantages such as high bandwidth, low power consumption, and high energy efficiency. Compared with traditional chips using von Neumann structure, their structure is inherently suitable for deep learning tasks such as convolution calculations and large-scale multiplication and addition operations. Nowadays, with the development of fields such as neural networks, various AI application scenarios have further increased their requirements for storage and computing chips (for example, cloud-based deep learning requires high computing power and high precision). Digital storage and computing chips are superior to analog storage and computing chips in such application scenarios because of their higher and wider precision, high energy efficiency, high reliability, and mature technology.

本申请设计为可重构式数字存算芯片而非可重构式模拟存算芯片，能在灵活性的基础上提升其精度与能效，实现三者兼顾；目前相似的实现方案包括：This application is designed as a reconfigurable digital storage and computing chip rather than a reconfigurable analog storage and computing chip, which can improve its accuracy and energy efficiency on the basis of flexibility, achieving a balance of the three; currently similar implementation schemes include:

1.南洋理工大学Hyunjoon Kim团队于2019年提出的一种在1至16比特下，权重及输入位精度可变的可重构数字存内计算核，缺点是其乘法器模块采用传统乘法器，效率较低、面积较大、功耗较高；1. A reconfigurable digital in-memory computing core with variable weight and input bit precision from 1 to 16 bits proposed by Hyunjoon Kim's team at Nanyang Technological University in 2019. The disadvantage is that its multiplier module uses a traditional multiplier, which has low efficiency, large area, and high power consumption;

2.普林斯顿大学Naveen Verma团队于2021年设计的一种可扩展、可重构的数字存内计算核阵列，缺点是其可重构模式较为复杂，不易实现，且其以4个存内计算核为一个单元建立可重构模式，灵活性不足；其实现乘累加操作的方式为传统方式，能效较低、面积较大；2. A scalable and reconfigurable digital in-memory computing core array designed by Naveen Verma's team at Princeton University in 2021. The disadvantage is that its reconfigurable mode is relatively complex and difficult to implement, and it uses four in-memory computing cores as a unit to establish a reconfigurable mode, which is not flexible enough; the way it implements multiplication and accumulation operations is the traditional way, which has low energy efficiency and large area;

3.BNN(Binary Neural Network，二值神经网络)网络中的一种同或乘法，缺点是只能支持1bit×1bit计算，不能扩展到多精度计算。3. A type of XOR multiplication in the BNN (Binary Neural Network) network. Its disadvantage is that it can only support 1bit×1bit calculation and cannot be extended to multi-precision calculation.

发明内容Summary of the invention

为解决现有技术中的不足，本申请提出了一种存内计算方法及芯片设计。In order to solve the deficiencies in the prior art, the present application proposes an in-memory computing method and chip design.

第一方面，提出了一种存内计算方法，在存内计算核中实现，所述存内计算核包括：静态随机存取存储器阵列、乘法器模块、加法器树模块和移位累加模块；方法包括：In a first aspect, an in-memory computing method is proposed, which is implemented in an in-memory computing core, wherein the in-memory computing core includes: a static random access memory array, a multiplier module, an adder tree module, and a shift-accumulate module; the method includes:

将输入值存入输入缓冲区模块，将权重值存入静态随机存取存储器阵列；Storing input values into an input buffer module and storing weight values into a static random access memory array;

读出输入值和权重值并传输进乘法器模块中；Read out the input value and weight value and transmit them into the multiplier module;

在乘法器模块中，通过异或逻辑运算得到输入值和权重值的乘累加结果；In the multiplier module, the multiplication and accumulation result of the input value and the weight value is obtained through the XOR logic operation;

在加法器树模块中，对乘累加结果进行加法计算；In the adder tree module, the multiplication and accumulation results are added;

在移位累加模块中，将多个周期内加法器树模块的计算结果分别与对应偏置值相加的结果进行累加，得到输出数据。In the shift-accumulate module, the calculation results of the adder tree module in multiple cycles are respectively accumulated with the results of adding the corresponding offset values to obtain output data.

在本申请较佳的实施例中，所述异或逻辑运算的步骤，包括：In a preferred embodiment of the present application, the steps of the XOR logic operation include:

步骤S3.1：判断输入值是否为1位，若是，执行步骤S3.2.A，若为否，则转向步骤S3.2.B；Step S3.1: Determine whether the input value is 1 bit. If so, execute step S3.2.A. If not, go to step S3.2.B.

步骤S3.2.A：判断权重值是否为1位，若是，执行步骤S3.3.A，若为否，则转向步骤S3.3.C；Step S3.2.A: Determine whether the weight value is 1, if so, execute step S3.3.A, if not, go to step S3.3.C;

步骤S3.2.B：判断权重值是否为1位，若是，执行步骤S3.3.B，若为否，则转向步骤S3.3.C；Step S3.2.B: Determine whether the weight value is 1, if so, execute step S3.3.B, if not, go to step S3.3.C;

步骤S3.3.A：进行1位输入值与1位权重值的乘累加操作；Step S3.3.A: Perform a multiplication and accumulation operation of the 1-bit input value and the 1-bit weight value;

步骤S3.3.B：进行多位输入值与1位权重值的乘累加操作；Step S3.3.B: Perform multiplication and accumulation operations on the multi-bit input value and the 1-bit weight value;

步骤S3.3.C：进行1位输入值与多位权重值的乘累加操作；Step S3.3.C: Perform multiplication and accumulation operations on the 1-bit input value and the multi-bit weight value;

步骤S3.3.D：进行多位输入值与多位权重值的乘累加操作。Step S3.3.D: Perform multiplication and accumulation operations on the multi-bit input value and the multi-bit weight value.

在本申请较佳的实施例中，所述1位输入值与1位权重值的乘累加操作，具体为：使用映射值1分别代表输入值-1和权重值-1，映射值0分别代表输入值+1和权重值+1，对输入值的映射值与权重值的映射值进行异或逻辑运算得到第一乘累加结果。In a preferred embodiment of the present application, the multiplication and accumulation operation of the 1-bit input value and the 1-bit weight value is specifically as follows: use the mapping value 1 to represent the input value -1 and the weight value -1 respectively, and the mapping value 0 to represent the input value +1 and the weight value +1 respectively, and perform an XOR logic operation on the mapping value of the input value and the mapping value of the weight value to obtain a first multiplication and accumulation result.

在本申请较佳的实施例中，所述多位输入值与1位权重值的乘累加操作，具体为：对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述输入值采用补码形式输入。In a preferred embodiment of the present application, the multiplication and accumulation operation of the multi-bit input value and the 1-bit weight value is specifically: performing a bitwise exclusive OR logic operation on the input value and the weight value to obtain a second multiplication and accumulation result, and the input value is input in the form of a complement code.

在本申请较佳的实施例中，所述1位输入值与多位权重值的乘累加操作，具体为：对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述权重值采用补码形式输入。In a preferred embodiment of the present application, the multiplication and accumulation operation of the 1-bit input value and the multi-bit weight value is specifically: performing a bitwise XOR logical operation on the input value and the weight value to obtain a second multiplication and accumulation result, and the weight value is input in the form of a complement code.

在本申请较佳的实施例中，所述多位输入值与多位权重值的乘累加操作，包括：In a preferred embodiment of the present application, the multiplication and accumulation operation of the multi-bit input value and the multi-bit weight value includes:

输入值采用补码形式输入，使用映射值1代表权重值-1，映射值0代表权重值+1；The input value is input in the form of complement, using mapping value 1 to represent weight value -1, and mapping value 0 to represent weight value +1;

将权重值的不同位数按位赋以2的幂次方权重，每一位权重值用映射值代替，再与输入值进行按位异或逻辑运算得到第三乘累加结果。The different digits of the weight value are assigned a power of 2 weight, each weight value is replaced by the mapping value, and then a bitwise XOR logic operation is performed with the input value to obtain the third multiplication and accumulation result.

第二方面，提出了一种芯片设计，包括：多个核；所述核包括：二值化模块、重塑模块、CIM核阵列、整合模块、单进程处理模块以及串行处理模块；In a second aspect, a chip design is proposed, including: a plurality of cores; the cores include: a binarization module, a reshaping module, a CIM core array, an integration module, a single process processing module, and a serial processing module;

二值化模块：用于将全局缓冲区传输进存内计算核的数据进行二值化；Binarization module: used to binarize the data transferred from the global buffer to the in-memory computing core;

重塑模块：用于将二值化后的数据进行重新排列；Reshape module: used to rearrange the binary data;

CIM核阵列：由存内计算核组成，其中存内计算核采用第一方面所述的存内计算方法计算重新排列后的数据；CIM core array: composed of in-memory computing cores, wherein the in-memory computing cores use the in-memory computing method described in the first aspect to calculate the rearranged data;

整合模块：用于整合经CIM核阵列计算后数据结果；Integration module: used to integrate the data results calculated by the CIM core array;

单进程处理模块：用于对整合结果进行并行处理；Single-process processing module: used for parallel processing of integration results;

串行处理模块：用于对并行处理结果进行串行处理；Serial processing module: used to perform serial processing on the parallel processing results;

其中，二值化模块、重塑模块、CIM核阵列、整合模块、单进程处理模块以及串行处理模块依次顺序连接。Among them, the binarization module, the reshaping module, the CIM core array, the integration module, the single-process processing module and the serial processing module are connected in sequence.

本申请一些实施例提供的技术方案带来的有益效果至少包括：The beneficial effects brought about by the technical solutions provided by some embodiments of the present application include at least:

(1)本申请构建了一套基于不同映射方案、以异或逻辑运算代替乘法的乘累加计算范式；此范式能解决传统乘法器的诸多问题，本申请一个乘法器所用的晶体管仅需10个，大大减小了电路的面积开销，提升了计算效率；(1) This application constructs a set of multiplication-accumulation calculation paradigms based on different mapping schemes and replacing multiplication with XOR logic operations; this paradigm can solve many problems of traditional multipliers. Only 10 transistors are needed for a multiplier of this application, which greatly reduces the area overhead of the circuit and improves the calculation efficiency;

(2)本申请芯片设计设计更灵活、更通用，采用16、32甚至更多存内计算核之间的资源调度，可支持高至8bit、16bit权重输入位精度的运算；(2) The chip design of this application is more flexible and versatile, using resource scheduling between 16, 32 or even more in-memory computing cores, and can support operations with weight input bit precision as high as 8-bit and 16-bit;

(3)本申请将完备的计算范式部署至存算芯片，可依据芯片资源实现任意位精度的运算，支持更多神经网络模型推理的部署与应用。(3) This application deploys a complete computing paradigm to storage and computing chips, which can realize operations of arbitrary bit precision based on chip resources and support the deployment and application of more neural network model reasoning.

本申请的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本申请而了解。本申请的目的和其他优点在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be described in the following description, and partly become apparent from the description, or understood by practicing the present application. The purpose and other advantages of the present application are realized and obtained by the structures specifically pointed out in the description, claims and drawings.

为使本申请的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, preferred embodiments are specifically cited below and described in detail with reference to the attached drawings.

本申请附加的方面的优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请实践了解到。Advantages of additional aspects of the present application will be given in part in the following description, and in part will become apparent from the following description, or will be learned through the practice of the present application.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面所描述的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the embodiments of the present application or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1为本申请实施例所示的存内计算核结构图；FIG1 is a diagram of an in-memory computing core structure shown in an embodiment of the present application;

图2为本申请实施例所示的存内计算方法的流程图；FIG2 is a flow chart of an in-memory computing method shown in an embodiment of the present application;

图3为本申请实施例所示的异或逻辑运算步骤的流程图；FIG3 is a flow chart of the XOR logic operation steps shown in an embodiment of the present application;

图4为本申请实施例所示的1位输入值与1位权重值乘累加操作示例图；FIG4 is a diagram showing an example of a multiplication and accumulation operation of a 1-bit input value and a 1-bit weight value according to an embodiment of the present application;

图5为本申请实施例所示的4位输入值与1位权重值乘累加操作的示例图；FIG5 is an example diagram of a multiplication and accumulation operation of a 4-bit input value and a 1-bit weight value shown in an embodiment of the present application;

图6为本申请实施例所示的1位输入值与4位权重值乘累加操作的示例图；FIG6 is an example diagram of a multiplication and accumulation operation of a 1-bit input value and a 4-bit weight value shown in an embodiment of the present application;

图7为本申请实施例所示的4位输入值与4位权重值乘累加操作的示例图；FIG7 is an example diagram of a multiplication and accumulation operation of a 4-bit input value and a 4-bit weight value shown in an embodiment of the present application;

图8为本申请实施例所示的存算芯片设计图。FIG8 is a design diagram of a storage and computing chip according to an embodiment of the present application.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本公开将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in a variety of forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that the disclosure will be more comprehensive and complete and to fully convey the concepts of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

此外，附图仅为本公开的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体，不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。In addition, the accompanying drawings are only schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the figures represent the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the accompanying drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities can be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

存内计算核内数据运算原理，如图1所示，在该存内计算核中，数据处理流程为：(1)输入值存入输入缓冲区(Input Buffer)，权重值存入SRAM(Static Random-AccessMemory，静态随机存取存储器)阵列；(2)输入值和权重值分别从Input Buffer和SRAM阵列读出，并传输进乘法器模块中；(3)将乘法器模块中通过异或乘法得到的结果传输进加法器树模块中，加法器树模块对乘累加结果进行加法计算。(4)加法器树模块的计算结果进入移位累加模块，移位累加模块将几个周期的结果分别移位累加起来，最后得到输出数据。The principle of data operation in the in-memory computing core is shown in Figure 1. In this in-memory computing core, the data processing flow is as follows: (1) The input value is stored in the input buffer (Input Buffer), and the weight value is stored in the SRAM (Static Random-Access Memory) array; (2) The input value and weight value are read out from the Input Buffer and the SRAM array respectively, and transmitted to the multiplier module; (3) The result obtained by XOR multiplication in the multiplier module is transmitted to the adder tree module, and the adder tree module performs addition calculation on the multiplication and accumulation results. (4) The calculation result of the adder tree module enters the shift accumulation module, and the shift accumulation module shifts and accumulates the results of several cycles respectively, and finally obtains the output data.

本申请采用数据流可重构方案，在架构层面，存内计算核阵列可以被任意个数的存内计算核单元任意组成，不同存内计算核之间可以更好地进行资源调度，解决不同精度的计算问题，大大降低了数字存算芯片的功耗、面积等开销。This application adopts a data flow reconfigurable solution. At the architectural level, the in-memory computing core array can be arbitrarily composed of any number of in-memory computing core units. Different in-memory computing cores can better schedule resources to solve computing problems of different precisions, greatly reducing the power consumption, area and other overheads of digital storage computing chips.

下面结合图2～图8对本申请提供的存内计算方法及芯片设计进行进一步地描述。The in-memory computing method and chip design provided by the present application are further described below in conjunction with FIGS. 2 to 8 .

实施例一Embodiment 1

本申请提出了一种存内计算方法，在存内计算核中实现，如图2所示，包括如下步骤：The present application proposes an in-memory computing method, which is implemented in an in-memory computing core, as shown in FIG2 , and includes the following steps:

步骤S1：将输入值存入输入缓冲区模块，将权重值存入静态随机存取存储器阵列；Step S1: storing the input value into the input buffer module and storing the weight value into the static random access memory array;

步骤S2：读出输入值和权重值并传输进乘法器模块中；Step S2: read out the input value and weight value and transmit them into the multiplier module;

步骤S3：在乘法器模块中，通过异或逻辑运算得到输入值和权重值的乘累加结果；Step S3: In the multiplier module, the multiplication and accumulation result of the input value and the weight value is obtained by an XOR logic operation;

具体到本实施例中，所述异或逻辑运算，如图3所示，包括如下步骤：Specifically in this embodiment, the XOR logic operation, as shown in FIG3 , includes the following steps:

步骤S3.2.B：判断权重值是否为1位，若是，执行步骤S3.3.B，若为否，则转向步骤S3.3.D；Step S3.2.B: Determine whether the weight value is 1, if so, execute step S3.3.B, if not, go to step S3.3.D;

具体实施过程中，使用映射值1分别代表输入值-1和权重值-1，映射值0分别代表输入值+1和权重值+1，对输入值的映射值与权重值的映射值进行异或逻辑运算得到第一乘累加结果；In a specific implementation process, a mapping value of 1 is used to represent an input value of -1 and a weight value of -1, and a mapping value of 0 is used to represent an input value of +1 and a weight value of +1, and an XOR logic operation is performed on the mapping value of the input value and the mapping value of the weight value to obtain a first multiplication-accumulation result;

具体实施过程中，对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述输入值采用补码形式输入；In a specific implementation process, a bitwise XOR logic operation is performed on the input value and the weight value to obtain a second multiplication-accumulation result, and the input value is input in the form of a complement code;

具体实施过程中，对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述权重值采用补码形式输入；In a specific implementation process, a bitwise XOR logic operation is performed on the input value and the weight value to obtain a second multiplication-accumulation result, and the weight value is input in the form of a complement code;

步骤S3.3.D：进行多位输入值与多位权重值的乘累加操作；Step S3.3.D: Perform multiplication and accumulation operations on the multi-bit input value and the multi-bit weight value;

具体实施过程中，输入值采用补码形式输入，使用映射值1代表权重值-1，映射值0代表权重值+1；In the specific implementation process, the input value is input in the form of complement, using the mapping value 1 to represent the weight value -1, and the mapping value 0 to represent the weight value +1;

步骤S4：在加法器树模块中，对乘累加结果进行加法计算；Step S4: in the adder tree module, performing addition calculation on the multiplication and accumulation results;

步骤S5：在移位累加模块中，将多个周期内加法器树模块的计算结果分别与对应偏置值相加的结果进行累加，得到输出数据。Step S5: In the shift accumulation module, the calculation results of the adder tree modules in multiple cycles are accumulated with the results of adding the corresponding offset values to obtain output data.

具体到本实施例中，偏置值是根据1的个数确定的，这个1的个数可以在芯片外得到，也就是说偏置值和输入、权重一样都是外来输入得到的；Specifically in this embodiment, the bias value is determined according to the number of 1s, and this number of 1s can be obtained outside the chip, that is, the bias value is obtained from external input like the input and weight;

具体的，经软件层训练确定1的个数；训练结果如下：Specifically, the number of 1s is determined through software layer training; the training results are as follows:

对于1位输入值与1位权重值的乘累加操作，偏置值为0。For the multiplication and accumulation operation of a 1-bit input value and a 1-bit weight value, the bias value is 0.

对于多位补码表示的输入值与1位权重值的乘累加操作，偏置值为权重值的映射值为1的次数。For the multiplication and accumulation operation of the input value represented by the multi-bit complement code and the 1-bit weight value, the bias value is the number of times the mapping value of the weight value is 1.

对于多位补码表示的输入值与多位权重值的乘累加操作，偏置值为映射值为1的权重值的位数所在位的2的幂次方乘以1。For the multiplication and accumulation operation of the input value represented by the multi-bit complement code and the multi-bit weight value, the bias value is 1 multiplied by the power of 2 of the bit where the weight value with the mapping value of 1 is located.

为了更清楚说明本存内计算方法中技术方案，下面举一个具体芯片示例来说明：In order to more clearly illustrate the technical solution of the in-memory computing method, a specific chip example is given below:

本示例芯片采用了一种特定的映射方式与异或逻辑运算相结合，且使用统计训练层面的偏置值来加速结果运算的方法，本示例中异或逻辑运算分类成三种运算方式，分别是1位输入值与1位权重值乘累加、多位输入值与1位权重值乘累加、1位输入值与多位权重值乘累加、多位输入值与多位权重值乘累加；具体介绍本示例芯片内乘法器模块的设计如下：This example chip uses a specific mapping method combined with XOR logic operation, and uses the bias value at the statistical training level to accelerate the result operation. In this example, the XOR logic operation is classified into three operation modes, namely, 1-bit input value and 1-bit weight value multiplication and accumulation, multi-bit input value and 1-bit weight value multiplication and accumulation, 1-bit input value and multi-bit weight value multiplication and accumulation, multi-bit input value and multi-bit weight value multiplication and accumulation; the specific design of the multiplier module in this example chip is as follows:

a.对于1位输入值与1位权重值的乘累加操作，如图4所示，使用映射值1分别代表输入值-1和权重值-1，映射值0分别代表输入值+1和权重值+1，如此映射后乘法运算即可用异或逻辑运算代替，得到结果均为准确结果，无需加偏置(bias)，直接累加得到乘累加结果即可；在硬件层面上，异或门造成的电路功耗和面积开销都远小于乘法器，同时大幅提升计算效率。a. For the multiplication and accumulation operation of 1-bit input value and 1-bit weight value, as shown in FIG4 , the mapping value 1 is used to represent the input value -1 and the weight value -1, and the mapping value 0 is used to represent the input value +1 and the weight value +1, so that the multiplication operation can be replaced by the XOR logic operation after mapping, and the results are all accurate results. There is no need to add bias, and the multiplication and accumulation results can be directly accumulated. At the hardware level, the circuit power consumption and area overhead caused by the XOR gate are much smaller than those of the multiplier, and the computing efficiency is greatly improved.

b.对于多位补码表示的输入值与1位权重值的乘累加操作(如图5所示，以4bit×1bit为例)，同样使用映射值1代表权重值-1，映射值0代表权重值+1，如此映射后用4bit补码输入值与1bit权重值的映射值按位异或运算得到第一步结果，此时若1位权重值的映射值是1，则计算的乘法结果比真实乘法结果少1，因此在训练时统计出权重值的映射值是1的次数，将此次数作为偏置bias与第一步结果相加，即可得到准确的乘累加结果；在硬件层面上，此方法与技术背景中方案2提到的方法相比，面积开销大幅降低，同时在训练时统计1的次数也进一步降低了电路开销。b. For the multiplication and accumulation operation of the input value represented by the multi-bit complement code and the 1-bit weight value (as shown in Figure 5, taking 4bit×1bit as an example), the mapping value 1 is also used to represent the weight value -1, and the mapping value 0 is used to represent the weight value +1. After such mapping, the 4-bit complement code input value and the mapping value of the 1-bit weight value are bitwise XORed to obtain the first step result. At this time, if the mapping value of the 1-bit weight value is 1, the calculated multiplication result is 1 less than the actual multiplication result. Therefore, during training, the number of times the mapping value of the weight value is 1 is counted, and this number is used as a bias and added to the first step result to obtain an accurate multiplication and accumulation result. At the hardware level, compared with the method mentioned in Solution 2 in the technical background, this method has a greatly reduced area overhead. At the same time, counting the number of 1s during training also further reduces the circuit overhead.

c.对于1位输入值与多位补码表示的权重值的乘累加操作(如图6所示，以1bit×4bit为例)，同样使用映射值1代表输入值-1，映射值0代表输入值+1，如此映射后用1bit输入值的映射值与4bit补码权重值按位异或运算得到第一步结果，此时若1位输入值的映射值是1，则计算的乘法结果比真实乘法结果少1，因此在训练时统计出输入值的映射值是1的次数，将此次数作为偏置bias与第一步结果相加，即可得到准确的乘累加结果；在硬件层面上，此方法与技术背景中方案2提到的方法相比，面积开销大幅降低，同时在训练时统计1的次数也进一步降低了电路开销。c. For the multiplication and accumulation operation of a 1-bit input value and a weight value represented by a multi-bit complement code (as shown in FIG6 , taking 1bit×4bit as an example), the mapping value 1 is used to represent the input value -1, and the mapping value 0 is used to represent the input value +1. After such mapping, the mapping value of the 1-bit input value and the 4-bit complement code weight value are bitwise XORed to obtain the first step result. At this time, if the mapping value of the 1-bit input value is 1, the calculated multiplication result is 1 less than the actual multiplication result. Therefore, during training, the number of times the mapping value of the input value is 1 is counted, and this number is used as a bias and added to the first step result to obtain an accurate multiplication and accumulation result. At the hardware level, compared with the method mentioned in Solution 2 in the technical background, this method has a greatly reduced area overhead. At the same time, counting the number of 1s during training also further reduces the circuit overhead.

d.对于多位补码表示的输入值与多位权重值的乘累加操作(如图7所示，以4bit×4bit为例)，使用映射值1代表权重值-1，映射值0代表权重值+1；d. For the multiplication and accumulation operation of the input value represented by the multi-bit complement code and the multi-bit weight value (as shown in FIG. 7 , taking 4 bit×4 bit as an example), the mapping value 1 is used to represent the weight value -1, and the mapping value 0 is used to represent the weight value +1;

进一步的，将权重值的不同位数按位赋以2的幂次方权重，每一位权重值用映射值代替，再与输入值进行乘法操作，得到第一步结果；Furthermore, different digits of the weight value are assigned a power of 2 weight, each weight value is replaced by the mapping value, and then multiplied with the input value to obtain the result of the first step;

进一步的，统计完成映射后的权重值为1的位数，再将此位数的2的幂次方权重乘以1作为bias与第一步结果相加，得到准确的乘累加结果。Furthermore, the number of digits whose weight value is 1 after the mapping is completed is counted, and the weight of the digit to the power of 2 is multiplied by 1 as a bias and added to the result of the first step to obtain an accurate multiplication and accumulation result.

本示例芯片实现了不同存内计算核之间更好地进行资源调度，解决不同精度的计算问题，大大降低了数字存算芯片的功耗、面积等开销。This example chip achieves better resource scheduling between different in-memory computing cores, solves computing problems of different precisions, and greatly reduces the power consumption, area and other overheads of digital memory computing chips.

本实施例相对于技术背景现有技术的优点主要有：The advantages of this embodiment over the prior art are mainly as follows:

1.方案1采用传统乘法器，效率较低、面积较大、功耗较高；本申请构建了一套基于不同映射方案、以异或逻辑运算代替乘法的乘累加计算范式；此范式能解决传统乘法器的诸多问题，一个异或乘法器采用的晶体管仅需10个，大大减小了电路的面积开销，提升了计算效率；1. Scheme 1 uses a traditional multiplier, which has low efficiency, large area and high power consumption. This application constructs a set of multiplication and accumulation calculation paradigms based on different mapping schemes and replaces multiplication with XOR logic operations. This paradigm can solve many problems of traditional multipliers. An XOR multiplier only needs 10 transistors, which greatly reduces the area overhead of the circuit and improves the calculation efficiency.

2.方案2可重构模式较为复杂，不易实现，且其以4个存内计算核为一个单元建立可重构模式，灵活性不足；其实现乘累加操作的方式为传统方式，能效较低、面积较大；本申请存内计算核通过控制数据流通路方案自由选择阵列组成单元，设计更灵活、更通用；采用16、32甚至更多存内计算核进行资源调度，可支持高至8比特、16比特权重输入位精度的运算；2. The reconfigurable mode of Scheme 2 is relatively complex and difficult to implement. In addition, it uses 4 in-memory computing cores as a unit to establish a reconfigurable mode, which is not flexible enough. The way it implements multiplication and accumulation operations is the traditional way, which has low energy efficiency and large area. The in-memory computing core of this application freely selects array component units by controlling the data flow path scheme, and the design is more flexible and more universal. It uses 16, 32 or even more in-memory computing cores for resource scheduling, and can support operations with up to 8-bit and 16-bit weight input bit precision.

3.方案3的同或逻辑运算仅支持1比特×1比特计算，不能扩展到多比特计算；异或门是结合一系列映射方案、计算模式下的选择，本申请通过异或计算逻辑实现了多比特的扩展，构建了一套全新的计算范式并完成了能够实现多精度的可重构式数字存算芯片设计；将完备的计算范式部署至存算芯片，可依据芯片上资源实现任意位精度的运算，支持更多神经网络模型推理的部署与应用。3. The XOR logic operation of Scheme 3 only supports 1-bit × 1-bit calculation and cannot be extended to multi-bit calculation; the XOR gate is a choice combined with a series of mapping schemes and calculation modes. This application realizes multi-bit expansion through XOR calculation logic, constructs a new computing paradigm and completes the design of a reconfigurable digital storage and computing chip that can achieve multi-precision; the complete computing paradigm is deployed to the storage and computing chip, and arbitrary bit precision calculations can be achieved based on the resources on the chip, supporting the deployment and application of more neural network model reasoning.

实施例二Embodiment 2

本申请提出了一种芯片设计，如图8所示；包括：多个核；所述核包括：二值化模块、重塑模块、CIM核阵列、整合模块、单进程处理模块以及串行处理模块。The present application proposes a chip design, as shown in FIG8 ; including: multiple cores; the cores include: a binarization module, a reshaping module, a CIM core array, an integration module, a single-process processing module, and a serial processing module.

具体的，依据神经网络的需求将全局缓冲区输入到存内计算核中的值量化到对应的比特位数；例如：8bit的数可以根据127为界量化到“0”或者“1”；Specifically, the value input from the global buffer to the in-memory computing core is quantized to the corresponding number of bits according to the requirements of the neural network; for example, an 8-bit number can be quantized to "0" or "1" based on 127 as the boundary;

具体的，排列成卷积的形式(3×3)，方便进入存内计算核阵列中进行卷积(也就是乘累加)计算；Specifically, they are arranged in a convolutional form (3×3) to facilitate entering the in-memory computing kernel array for convolution (i.e., multiplication and accumulation) calculations;

CIM(compute-in-memory，存算一体)核阵列：由存内计算核组成，其中存内计算核采用实施例一所述的存内计算方法计算重新排列后的数据；CIM (compute-in-memory) core array: composed of in-memory computing cores, wherein the in-memory computing cores use the in-memory computing method described in Example 1 to calculate the rearranged data;

具体的，在存内计算核中是SRAM阵列，通常SRAM阵列个数比较多(64*64、256*16等)，这个是存储阵列，也称存储单元阵列；Specifically, in the in-memory computing core is an SRAM array, usually with a large number of SRAM arrays (64*64, 256*16, etc.), which is a storage array, also called a storage cell array;

具体的，由于单个存内计算核处理的数据有限，需要进行数据整合后再输出结果；Specifically, since a single in-memory computing core can process limited data, it is necessary to integrate the data before outputting the results;

单进程处理模块：用于对整合结果进行并行处理(比如加偏置等)；Single-process processing module: used for parallel processing of integration results (such as adding bias, etc.);

串行处理模块：用于对并行处理结果进行串行处理(比如串行相加等)；Serial processing module: used to perform serial processing on parallel processing results (such as serial addition, etc.);

数据处理流程为：The data processing flow is:

(1)外部输入值首先会通过全局缓冲区分别进入多个核，例如核1、核2、核3…核n，依据外部需求调整数据流通路，重构各核之间的资源调配与调度。(1) External input values first enter multiple cores through the global buffer, such as core 1, core 2, core 3, and core n. The data flow path is adjusted according to external demand, and the resource allocation and scheduling between the cores are reconstructed.

(2)每个核通过二值化模块、重塑模块、CIM核阵列、整合模块、单进程处理模块以及串行处理模块进行数据二值化、重塑、CIM核阵列、整合、并行处理、串行处理步骤。(2) Each core performs data binarization, reshaping, CIM core array, integration, parallel processing, and serial processing steps through a binarization module, a reshaping module, a CIM core array, an integration module, a single process processing module, and a serial processing module.

(3)经过核处理之后的数据传回全局缓冲区中作进一步处理。(3) The data processed by the kernel is transferred back to the global buffer for further processing.

相比现有技术，本实施例的有益效果在于：Compared with the prior art, the beneficial effects of this embodiment are:

通过本实施例的芯片设计设计，其中CIM核阵列中存内计算核采用实施例一所述的存内计算方法，实现了如实施例一同样的有益效果，另外本实施例还支持多种神经网络格式(比如二值化后就可以支持BNN的部署)，其他精度的神经网络只是量化到对应精度。Through the chip design of this embodiment, the in-memory computing core in the CIM core array adopts the in-memory computing method described in Example 1, and the same beneficial effects as Example 1 are achieved. In addition, this embodiment also supports multiple neural network formats (for example, after binarization, it can support the deployment of BNN), and neural networks of other precisions are just quantized to the corresponding precision.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由所附的权利要求指出。Those skilled in the art will readily appreciate other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present disclosure, which follows the general principles of the present disclosure and includes common knowledge or customary techniques in the art that are not disclosed in the present disclosure. The specification and examples are intended to be exemplary only, and the true scope and spirit of the present disclosure are indicated by the appended claims.

Claims

Translated fromChinese

1.一种存内计算方法，在存内计算核中实现，所述存内计算核包括：静态随机存取存储器阵列、乘法器模块、加法器树模块和移位累加模块；其特征在于，包括：1. An in-memory computing method implemented in an in-memory computing core, the in-memory computing core comprising: a static random access memory array, a multiplier module, an adder tree module and a shift-accumulate module; characterized in that it comprises:

2.根据权利要求1所述的存内计算方法，其特征在于，所述异或逻辑运算的步骤，包括：2. The in-memory computing method according to claim 1, wherein the step of performing the XOR logic operation comprises:

3.根据权利要求2所述的存内计算方法，其特征在于，所述1位输入值与1位权重值的乘累加操作，具体为：3. The in-memory computing method according to claim 2, wherein the multiplication and accumulation operation of the 1-bit input value and the 1-bit weight value is specifically:

使用映射值1分别代表输入值-1和权重值-1，映射值0分别代表输入值+1和权重值+1，对输入值的映射值与权重值的映射值进行异或逻辑运算得到第一乘累加结果。Use mapping value 1 to represent input value -1 and weight value -1, and use mapping value 0 to represent input value +1 and weight value +1, and perform XOR logic operation on the mapping value of the input value and the mapping value of the weight value to obtain the first multiplication and accumulation result.

4.根据权利要求2所述的存内计算方法，其特征在于，所述多位输入值与1位权重值的乘累加操作，具体为：4. The in-memory computing method according to claim 2, wherein the multiplication and accumulation operation of the multi-bit input value and the 1-bit weight value is specifically:

对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述输入值采用补码形式输入。A bitwise exclusive OR logic operation is performed on the input value and the weight value to obtain a second multiplication-accumulation result, wherein the input value is input in the form of a complement code.

5.根据权利要求2所述的存内计算方法，其特征在于，所述1位输入值与多位权重值的乘累加操作，具体为：5. The in-memory computing method according to claim 2, wherein the multiplication and accumulation operation of the 1-bit input value and the multi-bit weight value is specifically:

对输入值与权重值进行按位异或逻辑运算得到第二乘累加结果，所述权重值采用补码形式输入。A bitwise exclusive OR logic operation is performed on the input value and the weight value to obtain a second multiplication-accumulation result, and the weight value is input in the form of a complement code.

6.根据权利要求2所述的存内计算方法，其特征在于，所述多位输入值与多位权重值的乘累加操作，包括：6. The in-memory computing method according to claim 2, wherein the multiplication and accumulation operation of the multi-bit input value and the multi-bit weight value comprises:

7.一种芯片设计，其特征在于，包括：多个核；所述核包括：二值化模块、重塑模块、CIM核阵列、整合模块、单进程处理模块以及串行处理模块；7. A chip design, characterized in that it comprises: a plurality of cores; the cores comprise: a binarization module, a reshaping module, a CIM core array, an integration module, a single process processing module and a serial processing module;

CIM核阵列：由存内计算核组成，其中存内计算核采用权利要求1～6中任一权利要求所述的存内计算方法计算重新排列后的数据；CIM core array: composed of in-memory computing cores, wherein the in-memory computing cores use the in-memory computing method described in any one of claims 1 to 6 to calculate the rearranged data;

串行处理模块：用于对并行处理结果进行串行处理。Serial processing module: used to perform serial processing on parallel processing results.