CN113867792A

Movatterモバイル変換

Info

Publication number: CN113867792A
Application number: CN202010619458.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2021-12-31
Also published as: WO2022001498A1

Abstract

Translated fromChinese

本披露公开了一种计算装置、集成电路芯片、板卡和使用前述计算装置来执行运算操作的方法。其中该计算装置可以包括在组合处理装置中，该组合处理装置还可以包括通用互联接口和其他处理装置。所述计算装置与其他处理装置进行交互，共同完成用户指定的计算操作。组合处理装置还可以包括存储装置，该存储装置分别与设备和其他处理装置连接，用于存储该设备和其他处理装置的数据。本披露的方案可以提升包括例如人工智能领域在内的各类数据处理领域运算的运行效率，从而降低运算的整体开销和成本。

The present disclosure discloses a computing device, an integrated circuit chip, a board, and a method for performing computing operations using the aforementioned computing device. Wherein the computing device may be included in a combined processing device, and the combined processing device may also include a universal interconnection interface and other processing devices. The computing device interacts with other processing devices to jointly complete the computing operation specified by the user. The combined processing device may further include storage devices, which are respectively connected with the device and other processing devices, and are used for storing data of the device and other processing devices. The solution of the present disclosure can improve the operation efficiency of various data processing fields including artificial intelligence, thereby reducing the overall overhead and cost of the operation.

Description

Computing device, integrated circuit chip, board card, electronic equipment and computing method

Technical Field

The present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic apparatus, and a computing method.

Background

In computing systems, an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a critical role in improving the performance of a computing chip (e.g., a processor) in the computing system. Various types of computing chips (particularly those in the field of artificial intelligence) currently utilize associated instruction sets to perform various general or specific control operations and data processing operations. However, current instruction sets suffer from a number of drawbacks. For example, existing instruction sets are limited to hardware architectures and perform poorly in terms of flexibility. Further, many instructions can only complete a single operation, and multiple operations often require multiple instructions to be performed, potentially leading to increased on-chip I/O data throughput. In addition, current instructions have improvements in execution speed, execution efficiency, and power consumption for the chip.

In addition, the arithmetic instructions of a conventional processor CPU are designed to be able to perform basic single data scalar arithmetic operations. Here, a single data operation refers to an instruction where each operand is a scalar datum. However, in tasks such as image processing and pattern recognition, the oriented operands are often data types of multidimensional vectors (i.e., tensor data), and the operation tasks cannot be efficiently performed by hardware using only scalar operations. Therefore, how to efficiently execute multidimensional tensor operation is also an urgent problem to be solved in the current computing field.

Disclosure of Invention

To address at least the above-identified problems in the prior art, the present disclosure provides a hardware architecture with an array of processing circuits. By utilizing the hardware architecture to execute computing instructions, aspects of the present disclosure may achieve technical advantages in a number of respects, including enhancing processing performance of hardware, reducing power consumption, increasing execution efficiency of computing operations, and avoiding computing overhead. Further, the disclosed solution supports efficient access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing computation overhead brought by tensor operations in the case that multidimensional vector operands are included in computation instructions.

In a first aspect, the present disclosure provides a computing device comprising: a processing circuit array formed by connecting a plurality of processing circuits in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured as a plurality of processing circuit sub-arrays, and performs a multi-thread operation in response to receiving a plurality of operation instructions obtained by parsing a calculation instruction received by the calculation apparatus, and wherein an operand of the calculation instruction includes a descriptor indicating a shape of a tensor used for determining a storage address of data corresponding to the operand,

the at least one sub-array of processing circuits is configured to execute at least one of the plurality of operational instructions according to the memory address.

In a second aspect, the present disclosure provides an integrated circuit chip comprising a computing device as described above and in a number of embodiments below.

In a third aspect, the present disclosure provides a board card comprising an integrated circuit chip as described above and in the following embodiments.

In a fourth aspect, the present disclosure provides an electronic device comprising an integrated circuit chip as described above and in a number of embodiments below.

In a fifth aspect, the present disclosure provides a method of performing a computation using the aforementioned computing device, wherein the computing device includes a processing circuit array formed by connecting a plurality of processing circuits in a one-dimensional or multi-dimensional array structure, and the processing circuit array is configured as a plurality of processing circuit sub-arrays, the method comprising: receiving a computation instruction at the computing device and parsing it to obtain a plurality of operation instructions, wherein an operand of the computation instruction comprises a descriptor for indicating a shape of a tensor, the descriptor for determining a storage address of data corresponding to the operand; in response to receiving the plurality of operation instructions, performing a multi-threaded operation with the plurality of sub-arrays of processing circuitry, wherein at least one sub-array of processing circuitry of the plurality of sub-arrays of processing circuitry is configured to execute at least one of a plurality of operation instructions according to the memory address.

By using the computing device, integrated circuit chip, board, electronic device and method of the present disclosure, an appropriate processing circuit array can be constructed according to the computing requirements, so that the computing instructions can be executed efficiently, the computing overhead can be reduced, and the throughput of I/O data can be reduced. In addition, since the processing circuit of the present disclosure can be configured to support corresponding operations according to the operation requirements, the number of operands of the calculation instruction of the present disclosure can be increased or decreased according to the operation requirements, and the type of the operation code can be arbitrarily selected and combined in the operation types supported by the processing circuit matrix, thereby expanding the application scenarios and the adaptability of the hardware architecture.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

FIG. 1a is a block diagram illustrating a computing device according to one embodiment of the present disclosure;

FIG. 1b is a schematic diagram illustrating a data storage space according to one embodiment of the present disclosure;

FIG. 2a is a block diagram illustrating a computing device according to another embodiment of the present disclosure;

FIG. 2b is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure;

FIG. 4 is an example block diagram illustrating an array of various types of processing circuits of a computing device in accordance with embodiments of the disclosure;

FIGS. 5a, 5b, 5c and 5d are schematic diagrams illustrating various connections of processing circuits according to embodiments of the present disclosure;

6a, 6b, 6c and 6d are schematic diagrams illustrating further various connections of processing circuits according to embodiments of the present disclosure;

7a, 7b, 7c, and 7d are schematic diagrams illustrating various looping structures of a processing circuit according to embodiments of the present disclosure;

8a, 8b, and 8c are schematic diagrams illustrating additional various looping structures of processing circuitry in accordance with embodiments of the present disclosure;

9a, 9b, 9c, and 9d are schematic diagrams illustrating data stitching operations performed by pre-operative circuitry according to embodiments of the present disclosure;

10a, 10b, and 10c are schematic diagrams illustrating data compression operations performed by post-operation circuitry according to embodiments of the present disclosure;

FIG. 11 is a simplified flow diagram illustrating a method of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure;

FIG. 12 is a block diagram illustrating a combined treatment device according to an embodiment of the present disclosure; and

fig. 13 is a schematic diagram illustrating a structure of a board according to an embodiment of the disclosure.

Detailed Description

The disclosed solution provides a hardware architecture that supports multi-threaded operations. When the hardware architecture is implemented in a computing device, the computing device includes at least a plurality of processing circuits, wherein the plurality of processing circuits are connected according to different configurations to form a one-dimensional or multi-dimensional array of structures. Depending on the implementation, the processing circuit array may be configured as a plurality of processing circuit sub-arrays, and each processing circuit sub-array may be configured to execute at least one of a plurality of arithmetic instructions. When tensor operations are involved, the operands of the computation instructions of the present disclosure may include descriptors that indicate the shape of the tensor, which may be used to determine the memory address of the operand corresponding data (e.g., the tensor), so that the subarray of processing circuitry may read and save the tensor data from the memory address to perform tensor operations associated with the tensor. By means of the hardware architecture and the operation instruction disclosed by the invention, the calculation operation comprising tensor operation can be efficiently executed, the application scene of calculation is expanded, and the calculation overhead is reduced.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

FIG. 1a illustrates an embodiment according to the present disclosureBlock diagram ofcomputing device 80. As shown in fig. 1a, thecomputing device 80 includes a processing circuit array formed by a plurality ofprocessing circuits 104. Specifically, the plurality of processing circuits are connected in a two-dimensional array structure to form a processing circuit array, and include a plurality of processing circuit sub-arrays, such as a plurality of one-dimensional processing circuit sub-arrays M shown in the drawing₁、M₂、……M_n. It should be understood that the processing circuit array of the two-dimensional structure and the plurality of one-dimensional processing circuit sub-arrays included therein are merely exemplary and not restrictive, the processing circuit array of the present disclosure may be configured to have an array structure of different dimensions according to different operation scenarios, and one or more closed loops may be formed within the processing circuit sub-arrays or between a plurality of processing circuit sub-arrays, as exemplary connections shown in fig. 5 to 8 which will be described later.

In one embodiment, in response to receiving a plurality of operation instructions, the processing circuit array of the present disclosure may be configured to perform multi-threaded operations, such as executing single instruction multi-threaded ("SIMT") instructions. Further, each of the processing circuit sub-arrays may be configured to execute at least one of the aforementioned plurality of operation instructions. In the context of the present disclosure, the aforementioned operation instructions may be micro instructions or control signals executed within a computing device (or processing circuit, processor), which may include (or indicate) one or more operation operations to be executed by the computing device. The arithmetic operations may include, but are not limited to, various operations such as addition operations, multiplication operations, convolution operations, pooling operations, and the like, and may also involve tensor operations, according to different operational scenarios. To this end, the operands of the compute instructions of the present disclosure may include descriptors that indicate the shape of the tensor. By using the memory address determined by the descriptor, one or more sub-arrays of processing circuits executing an arithmetic instruction can quickly access one or more tensors (or tensor data) to be used for the arithmetic operation.

In one embodiment, the plurality of operation instructions may include at least one multi-stage pipelined operation. In one scenario, the aforementioned one multi-stage pipelined operation may include at least two operation instructions. Depending on different execution requirements, the arithmetic instructions of the present disclosure may include predicates, and each of the processing circuits determines whether to execute the arithmetic instruction associated therewith according to the predicates. The processing circuit disclosed herein is flexible in performing various types of arithmetic operations including, but not limited to, arithmetic operations, logical operations, comparison operations, and table lookup operations, depending on the configuration.

Processing circuit matrix shown in fig. 1a and M comprised therein₁～M_nFor example, each processing circuit sub-matrix performs an n-level pipelined operation, wherein the processing circuit sub-matrix M₁Can be used as a first stage of pipeline operation unit in the pipeline operation, and the processing circuit submatrix M₂Can act as a second stage pipeline arithmetic unit in the pipeline arithmetic. By analogy, the circuit submatrix M is processed_nCan be used as an nth stage pipeline operation unit in the pipeline operation. In the process of executing the n-level pipeline operation, the first-level pipeline operation unit may start to execute each level of operation from top to bottom until the n-level pipeline operation is completed.

From the above exemplary description of a sub-array of processing circuits, it is understood that the processing circuit array of the present disclosure may in some scenarios be a one-dimensional array, and one or more processing circuits in the processing circuit array are configured as one sub-array of processing circuits. In other scenarios, the processing circuit array of the present disclosure is a two-dimensional array, and wherein one or more rows of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or one or more columns of processing circuits in said array of processing circuits are configured as a sub-array of said processing circuits; or one or more rows of processing circuits in a diagonal direction in the processing circuit array are configured as one sub-array of the processing circuits.

To implement multi-stage pipelined operations, the present disclosure may also provide corresponding computational instructions, and configure and construct an array of processing circuits based on the computational instructions to implement multi-stage pipelined operations. Depending on the operational scenario, the computational instructions of the present disclosure may include a plurality of opcodes, which may represent a plurality of operations performed by the processing circuit array. For example, when n in fig. 1a is 4 (i.e., when performing a 4-stage pipelining operation), the calculation instruction according to the disclosed aspect may be represented by the following equation (1):

Result＝convert((((scr0op0scr1)op1src2)op2src3)op3src4) (1)

where scr 0-src 4 are source operands (which may be tensors represented by descriptors of the disclosure in some computational scenarios), op 0-op 3 are opcodes, and convert denotes performing data conversion operations on data obtained after executingopcode op 4. According to various embodiments, the aforementioned data conversion operations may be performed by processing circuitry in an array of processing circuitry, or by additional operational circuitry, such as post-operational circuitry described in detail later in connection with fig. 3. According to the scheme of the disclosure, since the processing circuit can be configured to support corresponding operations according to the operation requirements, the number of operands of the calculation instruction of the disclosure can be increased or decreased according to the operation requirements, and the type of the operation code can be arbitrarily selected and combined in the operation types supported by the processing circuit matrix.

According to different application scenarios, the connection between the processing circuits of the present disclosure may be a hardware-based configuration connection (or "hard connection"), or a logical configuration connection (or "soft connection") may be performed through software configuration (e.g., through configuration instructions) based on a specific hardware connection. In one embodiment, the array of processing circuits may form a closed loop in at least one of the one-dimensional or multi-dimensional directions, i.e. a "looped structure" in the context of the present disclosure.

As described above, the arithmetic operation of the present disclosure further includes acquiring information about tensor shape using descriptors to determine storage addresses of tensor data, thereby acquiring and saving the tensor data by the aforementioned storage addresses.

In one possible implementation, the shape of the N-dimensional tensor data may be indicated by a descriptor, N being a positive integer, e.g., N ═ 1,2, or 3, or zero. The tensor can include various forms of data composition, the tensor can be of different dimensions, for example, a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a 2-dimensional tensor or a tensor with more than 2 dimensions. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a tensor:

the shape of the tensor can be described by a descriptor as (2, 4), i.e. the tensor is represented by two parameters as a two-dimensional tensor, with the size of the first dimension (column) of the tensor being 2 and the size of the second dimension (row) being 4. It should be noted that the manner in which the descriptors indicate the tensor shape is not limited in the present application.

In one possible implementation, the value of N may be determined according to the dimension (order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the descriptor may include an identification of the descriptor and/or the content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be its number; the content of the descriptor may include at least one shape parameter representing a shape of the tensor data. For example, the tensor data is 3-dimensional data, of three dimensions of the tensor data, in which shape parameters of two dimensions are fixed, the content of the descriptor thereof may include a shape parameter representing another dimension of the tensor data.

In one possible implementation, the identity and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM or other media cache, or the like. The tensor data indicated by the descriptors may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory, etc. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

In one possible implementation, the identity, content, and tensor data indicated by the descriptors may be stored in the same block of internal memory, e.g., a contiguous block of on-chip cache may be used to store the relevant content of the descriptors at addresses ADDR0-ADDR 1023. The addresses ADDR0-ADDR63 can be used as a descriptor storage space to store the identifier and content of the descriptor, and the addresses ADDR64-ADDR1023 can be used as a data storage space to store tensor data indicated by the descriptor. In the descriptor memory space, the identifiers of the descriptors may be stored with addresses ADDR0-ADDR31, and the contents of the descriptors may be stored with addresses ADDR32-ADDR 63. It should be understood that the address ADDR is not limited to 1 bit or one byte, and is used herein to mean one address, which is a unit of one address. The descriptor storage space, the data storage space, and their specific addresses may be determined by those skilled in the art in practice, and the present disclosure is not limited thereto.

In one possible implementation, the identity of the descriptors, the content, and the tensor data indicated by the descriptors may be stored in different areas of internal memory. For example, a register may be used as a descriptor storage space, the identifier and the content of the descriptor may be stored in the register, an on-chip cache may be used as a data storage space, and tensor data indicated by the descriptor may be stored.

In one possible implementation, where a register is used to store the identity and content of a descriptor, the number of the register may be used to represent the identity of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area in the buffer space can be allocated for storing the tensor data according to the size of the tensor data indicated by the descriptor.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.

In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, separate data storage spaces may be divided for tensor data, each of which has a one-to-one correspondence with descriptors at the start address of the data storage space. In this case, a circuit or module responsible for parsing the computation instruction (e.g., an entity external to the disclosed computing device or such ascontrol circuit 102 shown in fig. 2-3) may determine the data address of the data corresponding to the operand in the data storage space from the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may be further used to indicate an address of tensor data of the N dimension, where the content of the descriptor may further include at least one address parameter indicating the address of the tensor data. For example, the tensor data is 3-dimensional data, when the descriptor points to an address of the tensor data, the content of the descriptor may include one address parameter indicating the address of the tensor data, such as a starting physical address of the tensor data, or may include a plurality of address parameters of the address of the tensor data, such as a starting address of the tensor data + an address offset, or the tensor data is based on the address parameters of each dimension. The address parameters can be set by those skilled in the art according to practical needs, and the disclosure does not limit this.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may include a start address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the start address of the data storage space. When the data reference point of the descriptor is data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the data storage space in at least one of N dimensional directions, the size of the storage area in at least one of N dimensional directions, the offset of the storage area in at least one of N dimensional directions, the positions of at least two vertices located at diagonal positions in the N dimensional directions relative to the data reference point, and the mapping relationship between the data description positions of tensor data indicated by the descriptors and the data addresses. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that shape parameters representing tensor data can be selected by one skilled in the art based on practical considerations, which are not limited by the present disclosure. By using the descriptor in the data access process, the association between the data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

In one possible implementation, the content of the descriptor of the tensor data may be determined according to a reference address of a data reference point of the descriptor in a data storage space of the tensor data, a size of the data storage space in at least one of N dimensional directions, a size of the storage area in at least one of the N dimensional directions, and/or an offset of the storage area in at least one of the N dimensional directions.

FIG. 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1b, thedata storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X axis is horizontally right and the Y axis is vertically downward), the size in the X axis direction (the size of each line) is ori _ X (not shown in the figure), the size in the Y axis direction (the total number of lines) is ori _ Y (not shown in the figure), and the starting address PA _ start (the reference address) of thedata storage space 21 is the physical address of thefirst data block 22. The data block 23 is partial data in thedata storage space 21, and its offsetamount 25 in the X-axis direction is denoted as offset _ X, the offsetamount 24 in the Y-axis direction is denoted as offset _ Y, the size in the X-axis direction is denoted as size _ X, and the size in the Y-axis direction is denoted as size _ Y.

In a possible implementation manner, when the descriptor is used to define the data block 23, the data reference point of the descriptor may use the first data block of thedata storage space 21, and the reference address of the descriptor may be agreed as the starting address PA _ start of thedata storage space 21. The content of the descriptor of the data block 23 may then be determined in combination with the size ori _ X of thedata storage space 21 in the X axis, the size ori _ Y in the Y axis, and the offset amount offset _ Y of the data block 23 in the Y axis direction, the offset amount offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.

In one possible implementation, the content of the descriptor can be represented using the following equation (2):

it should be understood that although the content of the descriptor is represented by a two-dimensional space in the above examples, a person skilled in the art can set the specific dimension of the content representation of the descriptor according to practical situations, and the disclosure does not limit this.

In one possible implementation manner, a reference address of a data reference point of the descriptor in the data storage space may be defined, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertexes located at diagonal positions in the N-dimensional directions relative to the data reference point.

For example, a reference address PA _ base of a data reference point of the descriptor in the data storage space may be agreed. For example, one data (for example, data with position (2, 2)) may be selected as a data reference point in thedata storage space 21, and the physical address of the data in the data storage space may be used as the reference address PA _ base. The content of the descriptor of the data block 23 in fig. 1b can be determined from the position of the two vertices of the diagonal position relative to the data reference point. First, the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the top-left-to-bottom-right direction are used, wherein the relative position of the top-left vertex is (x _ min, y _ min), and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).

In one possible implementation, the content of the descriptor (with reference to PA _ base) can be represented using the following equation (3):

it should be understood that although the above examples use the vertex of two diagonal positions of the upper left corner and the lower right corner to determine the content of the descriptor, the skilled person can set the specific vertex of at least two vertices of the diagonal positions according to the actual needs, and the disclosure does not limit this.

In one possible implementation manner, the content of the descriptor of the tensor data can be determined according to a reference address of the data reference point of the descriptor in the data storage space and a mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. For example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (4):

in one possible implementation, the descriptor is further configured to indicate an address of the N-dimensional tensor data, where the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, for example, the content of the descriptor may be:

where PA is the address parameter. The address parameter may be a logical address or a physical address. The descriptor parsing circuit may obtain a corresponding data address by using PA as any one of a vertex, a middle point, or a preset point of a vector shape in combination with shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address includes a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, for example, the content of the descriptor may be:

wherein PA _ start is a reference address parameter, which is not described again.

It should be understood that, the mapping relationship between the data description location and the data address can be set by those skilled in the art according to practical situations, and the disclosure does not limit this.

In a possible implementation manner, a default base address can be set in a task, the base address is used by descriptors in instructions in the task, and shape parameters based on the base address can be included in the descriptor contents. This base address may be determined by setting an environmental parameter for the task. The relevant description and usage of the base address can be found in the above embodiments. In this implementation, the content of the descriptor can be mapped to the data address more quickly.

In one possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with a mode of setting a common reference address by using environment parameters, each descriptor in the mode can describe data more flexibly and use a larger data address space.

In one possible implementation, the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation methods of the data address are different when the content of the descriptor is represented in different ways. The present disclosure does not limit the specific calculation method of the data address.

For example, the content of the descriptor in the operand is expressed by formula (1), the amount of shift of the tensor data indicated by the descriptor in the data storage space is offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the starting data address PA1 of the tensor data indicated by the descriptor in the data storage space is_(x,y)The following equation (5) may be used to determine:

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (5)

the data start address PA1 determined according to the above equation (5)_(x,y)Combined offset_xand offset _ y, and the size _ x and size _ y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

In a possible implementation manner, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, a portion of the data (e.g., one or more data) in the tensor data indicated by the descriptor may be processed.

For example, the content of the descriptor in the operand is expressed by formula (2), the tensor data indicated by the descriptor is respectively offset in the data storage space by offset _ x and offset _ y, the size is size _ x × size _ y, and the data description position for the descriptor included in the operand is (x) x_q，y_q) Then, the data address PA2 of the tensor data indicated by the descriptor in the data storage space_(x,y)The following equation (6) may be used to determine:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (6)

the computing device of the present disclosure is described above with reference to fig. 1a and 1b, and by utilizing one or more processing circuit arrays in the computing device and based on the operating functions of the processing circuits, the computing instructions of the present disclosure are efficiently executed on the computing device to complete multithreading operations, thereby improving the execution efficiency of parallel operations and reducing the computation overhead. In addition, by using the descriptors to perform the operations for the tensor, the disclosed solution also significantly improves the access and processing efficiency of tensor data, and reduces the overhead for tensor operations.

FIG. 2a is a block diagram illustrating acomputing device 100 according to another embodiment of the present disclosure. As can be seen, thecomputing device 100 includescontrol circuitry 102 in addition to having thesame processing circuitry 104 as thecomputing device 80. In one embodiment,control circuitry 102 may be configured to obtain the computation instructions described above and parse the computation instructions to obtainTo the plurality of operation instructions corresponding to the plurality of operations represented by the operation code, for example, as represented by equation (1). In another embodiment, the control circuit configures the processing circuit array according to the plurality of operation instructions to obtain the plurality of processing circuit sub-arrays, such as the processing circuit sub-array M shown in fig. 1a₁、M₂……M_n。

In one application scenario, the control circuit may include a register for storing configuration information, and the control circuit may extract corresponding configuration information according to the plurality of operation instructions and configure the processing circuit array according to the configuration information to obtain the plurality of processing circuit sub-arrays. In another application scenario, the aforementioned register or other register of the control circuit may be configured to store information about the descriptor of the present disclosure, such as an identification of the descriptor and/or the content of the descriptor, so that the descriptor may be utilized to determine the storage address of the tensor data.

In one embodiment, the control circuitry may comprise one or more registers storing configuration information about the array of processing circuitry, the control circuitry being configured to read the configuration information from the registers and send it to the processing circuitry in accordance with the configuration instructions for the processing circuitry to connect with the configuration information. In one application scenario, the configuration information may include preset location information of the processing circuits constituting the one or more processing circuit arrays, and the location information may include, for example, coordinate information or label information of the processing circuits.

When the processing circuit array configuration forms a closed loop, the configuration information may further include looping configuration information regarding the processing circuit array forming a closed loop. Alternatively, in one embodiment, the configuration information may be carried directly by the configuration instruction instead of being read from the register. In this case, the processing circuit may be configured directly according to the position information in the received configuration instruction to form an array without a closed loop or further form an array with a closed loop with other processing circuits.

In configuring the connections to form a two-dimensional array by configuration instructions or by configuration information obtained from registers, the processing circuits located in the two-dimensional array are configured to connect with the remaining one or more of the processing circuits in the same row, column or diagonal in at least one of their row, column or diagonal directions in a predetermined two-dimensional pattern of intervals so as to form one or more closed loops. Here, the aforementioned predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.

Further, when the connection is configured to form a three-dimensional array in accordance with the aforementioned configuration instruction or configuration information, the processing circuit arrays are connected in a loop of a three-dimensional array constituted by a plurality of layers, wherein each layer includes a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, and wherein: the processing circuits located in the three-dimensional array are configured to connect with the remaining one or more processing circuits in the same row, column, diagonal, or different layers in at least one of their row, column, diagonal, and layer directions in a predetermined three-dimensional spacing pattern so as to form one or more closed loops. Here, the predetermined three-dimensional spacing pattern is associated with the number of spaces and the number of layers of spaces between the processing circuits to be connected.

FIG. 2b is a block diagram illustrating acomputing device 200 according to another embodiment of the present disclosure. As can be seen, thecomputing device 200 in fig. 2 includes amemory circuit 106 in addition to thecontrol circuit 102 and the plurality ofprocessing circuits 104 that are the same as thecomputing device 100.

In an application scenario, the storage circuit may be configured with interfaces for data transmission in multiple directions so as to be connected to theprocessing circuits 104, so that data to be operated by the processing circuits, intermediate results obtained during operation, and operation results obtained after operation can be stored accordingly. In view of the foregoing, in one application scenario, the storage circuit of the present disclosure may include a main storage module and/or a main cache module, wherein the main storage module is configured to store data for performing operations in the processing circuit array and operation results after performing operations, and the main cache module is configured to cache intermediate operation results after performing operations in the processing circuit array. In one application scenario, the aforementioned operation result and the intermediate operation result may be tensors, which may be stored in the storage circuit according to the storage address determined by the descriptor of the present disclosure. Further, the storage circuit may also have an interface for data transmission with an off-chip storage medium, so that data transfer between on-chip and off-chip systems may be achieved.

FIG. 3 is a block diagram illustrating acomputing device 300 according to yet another embodiment of the present disclosure. As can be seen, in addition to including thesame control circuitry 102, plurality ofprocessing circuitry 104, andstorage circuitry 106 as thecomputing device 200, thecomputing device 300 in fig. 3 also includesdata manipulation circuitry 108, which includespre-manipulation circuitry 110 andpost-manipulation circuitry 112. Based on such a hardware architecture, thepre-operation circuit 110 is configured to perform pre-processing of input data (e.g., tensor-type data) of at least one of the operation instructions, and thepost-operation circuit 112 is configured to perform post-processing of output data (e.g., tensor-type data) of at least one operation instruction. In one embodiment, the pre-processing performed by the pre-operation circuitry may include data placement and/or table lookup operations, and the post-processing performed by the post-operation circuitry may include data type conversion and/or compression operations.

In one application scenario, in performing a table lookup operation, the pre-operation circuitry is configured to look up one or more tables by an index value to obtain one or more constant terms associated with the operand from the one or more tables. Additionally or alternatively, the pre-operation circuitry is configured to determine an associated index value by the operand and to look up one or more tables by the index value to obtain one or more constant terms associated with the operand from the one or more tables.

In an application scenario, the pre-operation circuit may split the operation data according to the type of the operation data and the logical address of each processing circuit, and transmit the plurality of sub-data obtained after splitting to each corresponding processing circuit in the array for operation. In another application scenario, the pre-operation circuit may select one data splicing mode from multiple data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data. In one application scenario, the post-operation circuitry may be configured to perform compression operations on the data, including filtering the data with a mask or by comparison of a given threshold to a data size, to achieve compression of the data.

Based on the hardware architecture of FIG. 3 described above, the computing device of the present disclosure may execute computing instructions that include the aforementioned pre-processing and post-processing. Based on this, the data conversion operation of the calculation instruction as expressed in the foregoing equation (1) can be performed by the post-operation circuit described above. Two illustrative examples of computational instructions according to aspects of the present disclosure are given below:

example 1: TMUADCO ═ MULT + ADD + RELU (N) + CONVERTFP2FIX (7)

The instruction expressed in equation (7) above is a computational instruction that inputs a 3-element operand and outputs a 1-element operand, and it can be implemented by a matrix of processing circuits according to the present disclosure that includes three stages of pipelined operations (i.e., multiply + add + activate). Specifically, the ternary operation is A B + C, where the MULT microinstructions perform a multiplication operation between operands A and B to obtain a product value, i.e., a first stage pipelined operation. Then, ADD microinstructions are executed to perform the ADD operation of the product value and C to obtain the sum "N", which is the second stage of the pipeline operation. Then, the activation operation RELU, i.e., the third-stage pipeline operation, is performed on the result. After the three-stage pipeline operation, the micro instruction convert 2FIX may be finally executed by the post-operation circuit above, so as to convert the type of the result data after the activation operation from a floating point number to a fixed point number, so as to be output as a final result or input as an intermediate result to a fixed point operator for further calculation operation.

Example 2: TSEADMUAD SEARCHADD + MULT + ADD (8)

The instruction expressed in equation (8) above is a computational instruction that inputs a 3-element operand and outputs a 1-element operand, and it includes microinstructions that may be performed by a processing circuit matrix according to the present disclosure that includes two-stage pipelined operations (i.e., multiply + add). Specifically, the triple operation is A B + C, wherein the microinstruction SEARCHADD may be completed by the pre-operation circuit to obtain the lookup result A. The multiplication between operands A and B is then performed by a first stage pipelined operation to obtain the product value. Here, the operand A, B and the product value may be tensors read and saved according to descriptors of the present disclosure. Then, ADD microinstructions are executed to perform the ADD operation of the product value and C to obtain the sum result "N", i.e., the second stage of the pipeline operation. Likewise, when the result of the aforementioned summation is a tensor, it can also be saved according to the memory address determined by the descriptor of the present disclosure.

As described above, the computing instruction of the present disclosure can be flexibly designed and determined according to the requirement of computing, so that the hardware architecture including a plurality of processing circuit sub-matrices of the present disclosure can be designed and connected according to the computing instruction and the operation specifically performed by the computing instruction, thereby improving the execution efficiency of the instruction and reducing the computing overhead.

FIG. 4 is an example block diagram illustrating an array of various types of processing circuitry of acomputing device 400 according to an embodiment of this disclosure. As can be seen from the figure, thecomputing apparatus 400 shown in fig. 4 has a similar architecture to thecomputing apparatus 300 shown in fig. 3, so that the description of thecomputing apparatus 300 in fig. 3 also applies to the same details shown in fig. 4, and therefore the description thereof is omitted.

As can be seen in fig. 4, the plurality of processing circuits may include, for example, a plurality of first type processing circuits 104-1 and a plurality of second type processing circuits 104-2 (distinguished by different background colors in the figure). The plurality of processing circuits may be arranged by physical connections to form a two-dimensional array. For example, as shown in the figure, there are M rows and N columns (denoted as M x N) of processing circuits of the first type in the two-dimensional array, where M and N are positive integers greater than 0. The first type of processing circuit may be used to perform arithmetic and logical operations and may include, for example, linear operations such as addition, subtraction and multiplication, comparison operations, and nor non-linear operations, or any of a variety of combinations of the foregoing. Further, there are two columns of (M × 2+ M × 2) second-type processing circuits on the left and right sides of the periphery of the M × N first-type processing circuit arrays, and there are two rows of (N × 2+8) second-type processing circuits on the lower side of the periphery, that is, the processing circuit arrays have (M × 2+ N2 +8) second-type processing circuits in total. In one embodiment, the second type of processing circuit may be adapted to perform non-linear operations on the received data, such as comparison operations, table lookup operations or shift operations. In one or more embodiments, the first type of processing circuitry may form a first sub-array of processing circuitry of the present disclosure, and the second type of processing circuitry may form a second sub-array of processing circuitry of the present disclosure, for performing multi-threaded operations. In one scenario, when the multi-threaded operation involves a plurality of operational instructions and the plurality of operational instructions form a multi-stage pipelined operation, the first processing sub-array may perform several of the multi-stage pipelined operations while the second processing sub-array may perform another several of the multi-stage pipelined operations. In another scenario, when the multi-threaded operation involves a plurality of operational instructions and the plurality of operational instructions constitute two multi-stage pipelined operations, the first sub-array of processing circuits may perform a first multi-stage pipelined operation and the second sub-array of processing circuits may perform a second multi-stage pipelined operation.

In some application scenarios, the memory circuits employed by both the first type of processing circuit and the second type of processing circuit may have different memory sizes and memory manners. For example, a predicate storage circuit in a first type of processing circuit may store predicate information using a plurality of numbered registers. Further, the first type of processing circuit may access predicate information in a register of a corresponding number according to a register number specified in the received parsed instruction. As another example, the second type of processing circuit may store the predicate information in a static random access memory ("SRAM"). Specifically, the second-type processing circuit may determine a storage address of the predicate information in the SRAM according to an offset of a location where the predicate information is specified in the received parsed instruction, and may perform a predetermined read or write operation on the predicate information in the storage address.

Fig. 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of processing circuits according to embodiments of the present disclosure. As previously mentioned, the processing circuits of the present disclosure may be connected in a hard-wired manner or in a logically connected manner according to configuration instructions, thereby forming a topology of a connected one-or multi-dimensional array. When a plurality of processing circuits are connected in a multi-dimensional array, the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be connected in at least one of a row direction, a column direction, or a diagonal direction thereof with the remaining one or more of the processing circuits in the same row, the same column, or the same diagonal in a predetermined two-dimensional spacing pattern. Wherein the predetermined two-dimensional spacing pattern may be associated with a number of processing circuits spaced in the connection. Fig. 5a to 5c illustrate topologies of various forms of two-dimensional arrays between a plurality of processing circuits.

As shown in fig. 5a, five processing circuits (each represented by a box) are connected to form a simple two-dimensional array. Specifically, one processing circuit is connected to each of four directions, horizontal and vertical with respect to the processing circuit, with one processing circuit as the center of the two-dimensional array, thereby forming a two-dimensional array having a size of three rows and three columns. Further, since the processing circuits located at the center of the two-dimensional array are directly connected to the processing circuits adjacent to the previous column and the next column of the same row and the processing circuits adjacent to the previous row and the next row of the same row, respectively, the number of processing circuits at intervals (simply referred to as "interval number") is 0.

As shown in fig. 5b, four rows and four columns of processing circuits can be connected to form a two-dimensional Torus array, wherein each processing circuit is connected to the processing circuits of the previous row and the next row and the previous column and the next column adjacent to the processing circuit respectively, i.e. the number of intervals between which the adjacent processing circuits are connected is 0. Further, the first processing circuit in each row or column in the two-dimensional Torus array is also connected to the last processing circuit in the row or column, and the number of intervals between the end-to-end processing circuits in each row or column is 2.

As shown in fig. 5c, four rows and four columns of processing circuits may be connected to form a two-dimensional array with 0 number of spaces between adjacent processing circuits and 1 number of spaces between non-adjacent processing circuits. Specifically, the processing circuits adjacent to each other in the same row or the same column in the two-dimensional array are directly connected, that is, the number of intervals is 0, and the processing circuits not adjacent to each other in the same row or the same column are connected to the processing circuits with the number of intervals being 1. It can be seen that when a plurality of processing circuits are connected to form a two-dimensional array, there may be different numbers of spaces between processing circuits in the same row or column as shown in fig. 5b and 5 c. Similarly, in some scenarios, different numbers of intervals may be connected to the processing circuitry in the diagonal direction.

As shown in fig. 5d, with four two-dimensional Torus arrays as shown in fig. 5b, the four two-dimensional Torus arrays can be arranged at predetermined intervals and connected to form a three-dimensional Torus array. The three-dimensional Torus array is connected between layers by using a spacing mode similar to that between rows and columns on the basis of a two-dimensional Torus array. For example, the processing circuits of adjacent layers in the same row and column are first connected directly, i.e., the number of intervals is 0. Then, the first and last layers of processing circuits in the same row and column are connected, i.e., the number of intervals is 2. A three-dimensional Torus array of four layers, four rows, and four columns can be finally formed.

From the above examples, those skilled in the art will appreciate that the connection relationships of other multi-dimensional arrays of processing circuits may be formed by adding new dimensions and increasing the number of processing circuits on a two-dimensional array basis. In some application scenarios, aspects of the present disclosure may also configure logical connections to processing circuitry through the use of configuration instructions. In other words, although hard-wired connections may exist between processing circuits, aspects of the present disclosure may also selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more logical connections. In some embodiments, the aforementioned logical connections may also be adjusted according to the requirements of the actual operation (e.g., conversion of data types). Further, aspects of the present disclosure may configure the connections of the processing circuitry for different computational scenarios, including, for example, in a matrix or in one or more closed computational loops.

Fig. 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of processing circuits according to embodiments of the present disclosure. As can be seen, fig. 6a to 6d are still another exemplary connection relationships of the multi-dimensional array formed by the plurality of processing circuits shown in fig. 5a to 5 d. In view of this, the technical details described in connection with fig. 5a to 5d also apply to what is shown in fig. 6a to 6 d.

As shown in fig. 6a, the processing circuits of the two-dimensional array include a central processing circuit located at the center of the two-dimensional array and three processing circuits respectively connected in four directions in the same row and the same column as the central processing circuit. Therefore, the number of intervals of connection between the central processing circuit and the remaining processing circuits is 0,1, and 2, respectively. As shown in fig. 6b, the processing circuits of the two-dimensional array comprise a central processing circuit located in the center of the two-dimensional array and three processing circuits in two opposite directions in the same row as the processing circuit and one processing circuit in two opposite directions in the same column as the processing circuit. Therefore, the number of intervals connected between the central processing circuit and the processing circuit in the same row is 0 and 2, respectively, and the number of intervals connected between the central processing circuit and the processing circuit in the same column is 0.

As previously illustrated in connection with fig. 5d, the multi-dimensional array formed by the plurality of processing circuits may be a three-dimensional array made up of a plurality of layers. Wherein each layer of said three-dimensional array may comprise a two-dimensional array of a plurality of said processing circuits arranged in a row direction and a column direction thereof. Further, the processing circuits located in the three-dimensional array may be connected with the remaining one or more processing circuits on the same row, the same column, the same diagonal, or a different layer in at least one of a row direction, a column direction, a diagonal direction, and a layer direction thereof in a predetermined three-dimensional interval pattern. Further, the predetermined three-dimensional spacing pattern and the number of processing circuits spaced from each other in the connection may be related to the number of layers of spacing. The connection of the three-dimensional array will be further described with reference to fig. 6c and 6 d.

Figure 6c shows a three-dimensional array of multiple rows and columns of layers formed by the connection of multiple processing circuits. Take the processing circuits located at the l-th, r-th and c-th columns (denoted as (l, r, c)) as an example, which are located at the center of the array and are connected to the processing circuit at the previous column (l, r, c-1) and the processing circuit at the next column (l, r, c +1) of the same row at the same layer, the processing circuit at the previous row (l, r-1, c) and the processing circuit at the next row (l, r +1, c) of the same column at the same layer, and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l +1, r, c) of the different layer at the same row and the same column, respectively. Further, the number of intervals at which the processing circuit at (l, r, c) and the other processing circuits are connected in the row direction, the column direction, and the layer direction is all 0.

Fig. 6d shows a three-dimensional array when the number of intervals connecting between a plurality of processing circuits in the row direction, the column direction, and the layer direction is all 1. Taking the processing circuit located at the central position (l, r, c) of the array as an example, the processing circuit is respectively connected with the processing circuits at (l, r, c-2) and (l, r, c +2) of the front and back columns of the same row and the same column of the same layer and the processing circuits at (l, r-2, c) and (l, r +2, c) of the front and back columns of the same row and the same column and the same row. Further, the processing circuit is connected with the processing circuits at (l-2, r, c) and (l +2, r, c) of the front layer and the back layer of different layers in the same row and column. Similarly, the processing circuits at (l, r, c-3) and (l, r, c-1) of the remaining same layers, spaced by one column, are connected to each other, and the processing circuits at (l, r, c +1) and (l, r, c +3) are connected to each other. Then, the processing circuits at (l, r-3, c) and (l, r-1, c) in the same column and one row at the same layer are connected with each other, and the processing circuits at (l, r +1, c) and (l, r +3, c) are connected with each other. In addition, the processing circuits at (l-3, r, c) and (l-1, r, c) spaced one layer apart in the same row and column are connected to each other, and the processing circuits at (l +1, r, c) and (l +3, r, c) are connected to each other.

The connection relationship of the multi-dimensional array formed by the plurality of processing circuits is exemplarily described above, and the different loop structures formed by the plurality of processing circuits are further exemplarily described below with reference to fig. 7 to 8.

Fig. 7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of a processing circuit according to an embodiment of the disclosure. Depending on the application scenario, the processing circuits may be connected not only in a physical connection relationship, but also in a logical relationship configured according to the received analyzed instructions. The plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.

As shown in fig. 7a, the four adjacent processing circuits are numbered sequentially as "0, 1,2 and 3". Next, the four processing circuits are sequentially connected in a clockwise direction from theprocessing circuit 0, and theprocessing circuit 3 is connected to theprocessing circuit 0 so that the four processing circuits are connected in series to form a closed loop (simply referred to as "loop"). In this loop, the number of intervals of processing circuits is 0 or 2, e.g., the number of intervals between

processing circuits

0 and 1 is 0, and the number of intervals between

processing circuits

3 and 0 is 2. Further, the physical addresses (which may also be referred to as physical coordinates in the context of the present disclosure) of the four processing circuits in the illustrated loop may be represented as 0-1-2-3, while their logical addresses (which may also be referred to as logical coordinates in the context of the present disclosure) may likewise be represented as 0-1-2-3. It should be noted that the connection sequence shown in fig. 7a is only exemplary and not limiting, and those skilled in the art may connect four processing circuits in series in a counterclockwise direction to form a closed loop according to the actual calculation requirement.

In some practical scenarios, when the bit width of data supported by one processing circuit cannot meet the bit width requirement of the operation data, a plurality of processing circuits can be combined into one processing circuit group to represent one data. For example, assume that one processing circuit can process 8-bit data. When 32-bit data needs to be processed, 4 processing circuits may be combined into one processing circuit group so as to connect 4 8-bit data to form one 32-bit data. Further, one processing circuit group formed of the aforementioned 4 8-bit processing circuits can serve as oneprocessing circuit 104 shown in fig. 7b, so that an operation of a higher bit width can be supported.

As can be seen from fig. 7b, the layout of the processing circuit shown is similar to that shown in fig. 7a, but the number of intervals of connection between the processing circuits in fig. 7b is different from that in fig. 7 a. Fig. 7b shows that four processing circuits numbered sequentially with 0,1, 2 and 3connect processing circuit 1,processing circuit 3 andprocessing circuit 2 sequentially in a clockwise direction starting fromprocessing circuit 0, andprocessing circuit 2 connects toprocessing circuit 0, forming a closed loop in series. As can be seen from this loop, the number of intervals of the processing circuits shown in fig. 7b is 0 or 1, e.g. the interval between

processing circuits

0 and 1 is 0, while the interval between

processing circuits

1 and 3 is 1. Further, the physical addresses of the four processing circuits in the closed loop shown may be 0-1-2-3, while the logical addresses may be represented as 0-1-3-2 in the looped manner shown. Thus, when data of a high bit-width needs to be split to be allocated to different processing circuits, the data order can be rearranged and allocated according to the logical addresses of the processing circuits.

The splitting and rearranging operations described above may be performed by the pre-operative circuitry described in connection with fig. 3. In particular, the pre-operation circuit may rearrange the input data according to the physical and logical addresses of the plurality of processing circuits for satisfying the requirements of the data operation. Assuming that four sequentially arrangedprocessing circuits 0 to 3 are connected as shown in fig. 7a, since the physical address and the logical address of the connection are both 0-1-2-3, the previous operation circuit may sequentially transfer the input data (e.g., pixel data) aa0, aa1, aa2, and aa3 to the corresponding processing circuits. However, when the four processing circuits are connected as shown in FIG. 7b, their physical addresses remain unchanged from 0-1 to 2-3 and their logical addresses become 0-1 to 3-2, at which time the previous operation circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding processing circuit. Based on the input data rearrangement, the scheme disclosed by the invention can ensure the correctness of the data operation sequence. Similarly, if the sequence of the four operation output results (e.g., pixel data) obtained as described above is bb0-bb1-bb3-bb2, the sequence of the operation output results can be restored and adjusted to bb0-bb1-bb2-bb3 by using the post-operation circuit described in conjunction with fig. 2 for ensuring the consistency of arrangement between the input data and the output result data.

Fig. 7c and 7d show that further processing circuits are arranged and connected in different ways, respectively, to form a closed loop. As shown in fig. 7c, the 16processing circuits 104 numbered sequentially with 0,1 … 15 are sequentially connected and combined every two processing circuits, starting fromprocessing circuit 0, to form one processing circuit group (i.e., a processing circuit sub-array of the present disclosure). For example, as shown in the figure, theprocessing circuit 0 is connected to theprocessing circuit 1 to form one processing circuit group … …. By analogy, theprocessing circuits 14 are connected with theprocessing circuits 15 to form one processing circuit group, and finally eight processing circuit groups are formed. Further, the eight processing circuit groups may also be connected in a similar manner to the connection of the processing circuits described above, including being connected according to, for example, a predetermined logical address to form a closed loop of one processing circuit group.

As shown in fig. 7d, the plurality ofprocessing circuits 104 are connected in an irregular or non-uniform manner to form a processing circuit matrix having closed loops. In particular, in fig. 7d it is shown that the processing circuits may be spaced apart by a number of 0 or 3 to form a closed loop, forexample processing circuit 0 may be connected to processing circuit 1 (spaced apart by a number of 0) and processing circuit 4 (spaced apart by a number of 3), respectively.

As will be appreciated from the above description in connection with fig. 7a, 7b, 7c and 7d, the processing circuits of the present disclosure may be spaced apart by different numbers of processing circuits so as to be connected in a closed loop. When the total number of the processing circuits changes, any number of the intermediate intervals can be selected to be dynamically configured so as to be connected into a closed loop. It is also possible to combine a plurality of processing circuits into a processing circuit group and connect them into a closed loop of the processing circuit group. In addition, the connection of the plurality of processing circuits may be a hard connection method configured by hardware or a soft connection method configured by software.

Figures 8a, 8b, and 8c are schematic diagrams illustrating additional various loop structures of a processing circuit according to embodiments of the present disclosure. A plurality of processing circuits as shown in connection with fig. 6 may form a closed loop and each processing circuit in the closed loop may be configured with a respective logical address. Further, the pre-operation circuit described in conjunction with fig. 2 may be configured to split the operation data according to the type (e.g., 32-bit data, 16-bit data, or 8-bit data) and the logic address of the operation data, and respectively transfer the multiple sub-data obtained after splitting to corresponding processing circuits in the loop for subsequent operation.

The diagram in fig. 8a shows that four processing circuits are connected to form a closed loop and that the physical addresses of the four processing circuits in right to left order may be denoted as 0-1-2-3. The lower diagram of fig. 8a shows that the logical addresses of the four processing circuits in the loop described previously are represented as 0-3-1-2 in right-to-left order. For example, the processing circuit illustrated in the lower graph of FIG. 8a with a logical address of "3" has a physical address of "1" as illustrated in the upper graph of FIG. 8 a.

In some application scenarios, it is assumed that the granularity of the operation data is low 128 bits of the input data, such as the original sequence "15, 14, … … 2,1, 0" in the figure (each digit corresponds to 8-bit data), and the logical addresses of the 16 8-bit data are set to be numbered from low to high in the order of 0-15. Further, according to the logical addresses shown in the lower diagram of fig. 8a, the pre-operation circuit may encode or arrange data with different logical addresses according to different data types.

When the processing circuit operates with a data bit width of 32 bits, 4 numbers with logical addresses of (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14,13,12), respectively, can represent the 0-3 rd 32-bit data, respectively. The pre-operation circuit may transfer the 0 th 32-bit data to a processing circuit with a logical address "0" (the corresponding physical address is "0"), may transfer the 1 st 32-bit data to a processing circuit with a logical address "1" (the corresponding physical address is "2"), may transfer the 2 nd 32-bit data to a processing circuit with a logical address "2" (the corresponding physical address is "3"), and may transfer the 3 rd 32-bit data to a processing circuit with a logical address "3" (the corresponding physical address is "1"). Through the rearrangement of the data, the subsequent operation requirement of the processing circuit is met. The mapping relationship between the logical address and the physical address of the final data is therefore (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0) - > (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1, 0).

When the processing circuit operates with data bits of 16 bits wide, 8 numbers with logical addresses (1,0), (3,2), (5,4), (7,6), (9,8), (11,10), (13,12) and (15,14) respectively can represent the 16-bit data of 0-7. The front operation circuit may transfer the 0 th and 4 th 16-bit data to a processing circuit with a logical address "0" (the corresponding physical address is "0"), may transfer the 1 st and 5 th 16-bit data to a processing circuit with a logical address "1" (the corresponding physical address is "2"), may transfer the 2 nd and 6 th 16-bit data to a processing circuit with a logical address "2" (the corresponding physical address is "3"), and may transfer the 3 rd and 7 th 16-bit data to a processing circuit with a logical address "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is: (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0) - > (13,12,5,4,11,10,3,2,15,14,7,6,9,8,1, 0).

When the bit width of the data operated by the processing circuit is 8 bits, 16 numbers with the logic addresses of 0-15 can respectively represent 8-bit data of 0-15. According to the connection shown in fig. 8a, the pre-operation circuit can transmit the 0 th, 4 th, 8 th and 12 th 8bit data to the processing circuit with logical address "0" (corresponding to physical address "0"); the 1 st, 5 th, 9 th and 13 th 8bit data can be transmitted to the processing circuit with the logical address of "1" (the corresponding physical address is "2"); the 2 nd, 6 th, 10 th and 14 th 8bit data can be transmitted to the processing circuit with the logical address of "2" (the corresponding physical address is "3"); the 3 rd, 7 th, 11 th and 15 th 8bit data can be transmitted to the processing circuit with the logical address of "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is: (15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0) - > (14,19,6,2,13,9,5,1,15,11,7,3,12,8,4, 0).

The diagram in fig. 8b shows eight sequentially numberedprocessing circuits 0 to 7 connected to form a closed loop and the eight processing circuits have physical addresses 0-1-2-3-4-5-6-7. The logic addresses of the eight processing circuits described above are shown in the diagram below fig. 8b as 0-7-1-6-2-5-3-4. For example, the processing circuit illustrated on fig. 8b with a physical address of "6" corresponds to the logical address illustrated under fig. 8b with a logical address of "3".

The operation shown in fig. 8b for rearranging data and then transmitting the rearranged data to the corresponding processing circuit for different data types is similar to that shown in fig. 8a, so the technical solution described in conjunction with fig. 8a is also applicable to fig. 8b, and the above data rearrangement operation process is not repeated here. Further, the connection relationship of the processing circuits shown in fig. 8b is similar to that shown in fig. 8a, but fig. 8b shows that eight processing circuits are twice as many as the processing circuits shown in fig. 8 a. Thus, in an application scenario operating according to different data types, the granularity of the operation data described in connection with FIG. 8b may be twice the granularity of the operation data described in connection with FIG. 8 a. Thus, the granularity of the operational data in this example may be low 256 bits of the input data, as opposed to the low 128 bits of granularity of the input data in the previous example, such as the original data sequence "31, 30, … …,2,1, 0" shown in the figure, each number corresponding to an 8-bit ("bit") length.

With respect to the above-mentioned original data sequence, when the bit widths of the data operated by the processing circuits are 32bit, 16bit and 8bit, respectively, the arrangement results of the data in the looped processing circuits are also shown in the figure, respectively. For example, when the bit width of the data to be operated on is 32 bits, 1 piece of 32-bit data in the processing circuit with the logical address "1" is (7,6,5,4), and the corresponding physical address of the processing circuit is "2". And when the bit width of the data to be operated is 16 bits, the 2 16-bit data in the processing circuit with the logical address of "3" is (23,22,7,6), and the corresponding physical address of the processing circuit is "6". When the bit width of the data to be operated is 8 bits, the data of 4 8 bits in the processing circuit with the logical address of 6 is (30,22,14,6), and the corresponding physical address of the processing circuit is 3.

The above description has been made for data operations of different data types in connection with the case where a plurality of single type processing circuits (e.g., the first type processing circuit shown in fig. 3) shown in fig. 8a and 8b are connected to form a closed loop. Further description will be made below for data operations of different data types in connection with a case where a plurality of different types of processing circuits (such as the first type of processing circuit and the second type of processing circuit shown in fig. 4) shown in fig. 8c are connected to form a closed loop.

The diagram in figure 8c shows that twenty multi-type processing circuits, numbered sequentially with 0,1 … … 19, are connected to form a closed loop (numbered as the physical addresses of the processing circuits shown in the diagram). The sixteen processing circuits numbered from 0 through 15 are of a first type of processing circuit (i.e., forming a sub-array of processing circuits of the present disclosure), and the four processing circuits numbered from 16 through 19 are of a second type of processing circuit (i.e., forming a sub-array of processing circuits of the present disclosure). Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit illustrated in the lower diagram of fig. 8 c.

Further, when operating on different data types, for example, for the original sequence of 80 8 bits shown in the figure, fig. 8c also shows the result after operating on the aforementioned original data for different data types supported by the processing circuit. For example, when the bit width of the data to be operated on is 32 bits, 1 piece of 32-bit data in the processing circuit with the logical address "1" is (7,6,5,4), and the corresponding physical address of the processing circuit is "2". And when the bit width of the data to be operated on is 16 bits, the 2 pieces of 16-bit data in the processing circuit with the logical address of "11" are (63,62,23,22), and the corresponding physical address of the processing circuit is "9". And when the bit width of the data to be operated on is 8 bits, the 4 8-bit data in the processing circuit with the logical address of "17" is (77,57,37,17), and the corresponding physical address of the processing circuit is "18".

9a, 9b, 9c, and 9d are schematic diagrams illustrating data stitching operations performed by pre-processing circuitry according to embodiments of the present disclosure. As previously mentioned, the pre-processing circuit described in connection with fig. 2 of the present disclosure may be further configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction to perform a splicing operation on the input two data. With respect to multiple data stitching modes, in one embodiment, the disclosed scheme forms different data stitching modes by dividing and numbering two data to be stitched by a minimum data unit, and then extracting different minimum data units of the data based on a specified rule. For example, the decimation and the tiling may be performed, e.g., alternately, based on the parity of the numbers or whether the numbers are integer multiples of a specified number, thereby forming different data concatenation patterns. Depending on different calculation scenarios (e.g. different data bit widths), the minimum data unit here may be simply 1 bit or 1 bit data, or 2 bits, 4 bits, 8 bits, 16 bits or 32 bits or bit length. Further, when extracting different numbered portions of two data, the scheme of the present disclosure may extract alternately with the minimum data unit, or may extract with a multiple of the minimum data unit, for example, extract partial data of two or three minimum data units alternately from two data at a time as a group to be spliced by group.

Based on the above description of the data splicing patterns, the data splicing patterns of the present disclosure will be exemplarily explained in specific examples in conjunction with fig. 9a to 9 c. In the illustrated diagram, the input data are In1 and In2, and when each square In the diagram represents one minimum data unit, both input data have a bit width length of 8 minimum data units. As previously described, the minimum data unit may represent a different number of bits (or bits) for data of different bit width lengths. For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data having a bit width of 32 bits, the minimum data unit represents 4 bits of data.

As shown In fig. 9a, the two input data to be spliced In1 and In2 are each composed of eight minimum data units numbered 1,2, … …,8 sequentially from right to left. And performing data splicing according to the odd-even interleaving principle that the serial numbers are from small to large, In1 is firstly followed by In2, and odd serial numbers are firstly followed by even serial numbers. Specifically, when the data bit width of the operation is 8 bits, the data In1 and In2 each represent one 8-bit data, and each minimum data unit represents 1-bit data (i.e., one square represents 1-bit data). According to the bit width of the data and the splicing principle, the minimum data units numbered 1, 3, 5 and 7 of the extracted data In1 are arranged In the lower order. Next, four odd-numbered minimum data cells of the data In2 are sequentially arranged. Similarly, the minimum data units of data In1 numbered 2, 4,6, and 8 and the four even-numbered minimum data units of data In2 are sequentially arranged. Finally, 1 16-bit or 2-bit new data is formed from the 16 smallest data cells, as shown by the second row of squares in fig. 9 a.

As shown In fig. 9b, when the data bit width is 16 bits, the data In1 and In2 each represent 16 bits of data, and each minimum data unit represents 2 bits of data (i.e. one square represents one 2 bits of data). According to the bit width of the data and the foregoing interleaving principle, the minimum data units numbered 1,2, 5, and 6 of the data In1 may be extracted first and arranged In the lower order. Then, the minimum data units of the data In2 numbered 1,2, 5, and 6 are sequentially arranged. Similarly, the data In1 minimum data cells numbered 3, 4,7, and 8 and the data In2 are sequentially arranged to splice 1 32-bit or 2 16-bit new data consisting of the final 16 minimum data cells, as shown In the second row of squares In fig. 9 b.

As shown In fig. 9c, when the data bit width is 32 bits, the data In1 and In2 each represent 32 bits of data, and each minimum data unit represents 4 bits of data (i.e., one square represents one 4 bits of data). According to the bit width of the data and the aforementioned interleaving principle, the minimum data units numbered 1,2, 3 and 4 of the data In1 and numbered the same as the data In2 can be extracted first and arranged In the lower order. Then, the minimum data units numbered 5, 6, 7 and 8 of the extracted data In1 and numbered the same as the data In2 are sequentially arranged, so that 1 64-bit or 2 32-bit new data composed of the final 16 minimum data units are spliced.

Exemplary data stitching approaches of the present disclosure are described above in connection with fig. 9 a-9 c. However, it will be appreciated that in some computing scenarios, data stitching does not involve the staggered arrangement described above, but rather a simple arrangement of two data items, with the respective original data locations being maintained, such as shown in fig. 9 d. As can be seen from fig. 9d, the two data In1 and In2 do not perform the interleaving arrangement as shown In fig. 9 a-9 c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 are connected In series, thereby obtaining a new data type with increased (e.g., doubled) bit width. In some scenarios, the disclosed approach may also perform group stitching based on data attributes. For example, neuron data or weight data having the same feature map may be grouped and arranged to form a continuous portion of the stitched data.

10a, 10b, and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuitry according to embodiments of the present disclosure. The compression operation may include screening the data with a mask or compressing by comparison of a given threshold with the data size. With respect to data compression operations, they may be divided and numbered by the minimum data unit as previously described. Similar to that described in connection with fig. 9 a-9 d, the minimum data unit may be, for example, 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit, or 32-bit or bit length. An exemplary description will be made below with respect to different data compression modes in conjunction with fig. 10a to 10 c.

As shown in fig. 10a, the original data is composed of eight squares (i.e., eight minimum data units) sequentially numbered 1,2 … …,8 from right to left, assuming that each minimum data unit can represent 1-bit data. When performing a data compression operation according to the mask, the post-processing circuitry may filter the raw data with the mask to perform the data compression operation. In one embodiment, the bit width of the mask corresponds to the number of smallest data units of the original data. For example, if the original data has 8 minimum data units, the mask bit width is 8 bits, and the minimum data unit numbered 1 corresponds to the least significant bit of the mask, and the minimum data unit numbered 2 corresponds to the second least significant bit of the mask. By analogy, the smallest data unit numbered 8 corresponds to the highest bit of the mask. In one application scenario, when the 8-bit mask is "10010011," the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit with the mask of "1. For example, the numbers of the smallest data units corresponding to a mask value of "1" are 1,2, 5, and 8. Thus, the smallest data units numbered 1,2, 5 and 8 may be extracted and arranged in order from lower to higher in number to form the compressed new data, as shown in the second line of fig. 10 a.

FIG. 10b shows the original data similar to FIG. 10a, and as can be seen in the second row of FIG. 10b, the data sequence through the post-processing circuitry maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode, such that no compression operation is performed as the data passes through the post-processing circuitry.

As shown in fig. 10c, the original data is composed of eight squares arranged in sequence, the number above each square indicates its number, numbered 1,2 … … 8 in order from right to left, and it is assumed that each minimum data unit can be 8-bit data. Further, the number in each square represents the decimal value of the minimum data unit. Taking the minimum data unit numbered 1 as an example, the decimal value is "8", and the corresponding 8-bit data is "00001111". When performing a data compression operation according to the threshold value, assuming that the threshold value is decimal data "8", the compression rule may be set to extract all minimum data units in the original data that are greater than or equal to the threshold value "8". Thus, the smallest data units numbered 1, 4,7, and 8 can be extracted. Then, all the extracted minimum data units are arranged in descending order of number to obtain the final data result, as shown in the second row of fig. 10 c. FIG. 11 is a simplified flow diagram illustrating amethod 1100 of performing an arithmetic operation using a computing device in accordance with an embodiment of the present disclosure. From the foregoing, it will be appreciated that the computing device herein may be the computing device described in connection with fig. 1 (including fig. 1a and 1b) -4, having processing circuit connections as shown in fig. 5-10 and supporting additional classes of operations.

As shown in FIG. 11, atstep 1110,method 1100 receives a computation instruction at the computing device and parses it to obtain a plurality of operation instructions. In one embodiment, an operand of the compute instruction includes a descriptor indicating a shape of a tensor, the descriptor for determining a storage address of data corresponding to the operand. Next, atstep 1120, themethod 1100 performs a multi-threaded operation with the plurality of sub-arrays of processing circuits in response to receiving the plurality of operation instructions, wherein at least one sub-array of processing circuits of the plurality of sub-arrays of processing circuits is configured to execute at least one operation instruction of a plurality of operation instructions according to the memory address.

The calculation method of the present disclosure has been described above only in conjunction with fig. 11 for the sake of simplicity. Those skilled in the art can also appreciate that the method may include more steps according to the disclosure of the present disclosure, and the execution of the steps may implement various operations of the present disclosure described above in conjunction with fig. 1 to 10, which are not described herein again.

Fig. 12 is a block diagram illustrating a combinedprocessing device 1200 according to an embodiment of the present disclosure. As shown in fig. 12, the combinedprocessing device 1200 includes a computing processing device 1202, aninterface device 1204,other processing devices 1206, and astorage device 1208. Depending on the application scenario, one ormore computing devices 1210 may be included in the computing processing device and may be configured to perform the operations described herein in conjunction with fig. 1-11.

In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 1302 shown in fig. 13). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combinatorial processing devices as shown in fig. 12. The chip may be connected to other associated components through an external interface device, such asexternal interface device 1306 shown in fig. 13. The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 13.

Fig. 13 is a schematic diagram illustrating a structure of aboard 1300 according to an embodiment of the present disclosure. As shown in fig. 13, the board includes a memory device 1304 for storing data, which includes one ormore memory cells 1310. The memory device may be connected and data transferred to and from thecontrol device 1308 and the chip 1302 as described above by means of, for example, a bus. Further, the board card also includes anexternal interface device 1306 configured for data relay or transfer functions between the chip (or chips in the chip package structure) and an external device 1312 (such as a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 12 and 13, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a computing device, comprising:

a processing circuit array which is formed by connecting a plurality of processing circuits in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured as a plurality of processing circuit sub-arrays, and performs a multi-thread operation in response to receiving a plurality of operation instructions,

wherein the plurality of arithmetic instructions are obtained by parsing a computation instruction received by the computation device, and wherein an operand of the computation instruction comprises a descriptor for indicating a shape of a tensor, the descriptor for determining a storage address of data corresponding to the operand,

Clause 2, the computing apparatus ofclause 1, wherein the computing instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

Clause 3, the computing device ofclause 2, wherein the contents of the descriptor further include at least one address parameter representing an address of tensor data.

Clause 4, the computing device ofclause 3, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

Clause 5, the computing device ofclause 4, wherein the shape parameters of the tensor data comprise at least one of:

the size of the data storage space in at least one of N dimensional directions, the size of a storage region of the tensor data in at least one of the N dimensional directions, the offset of the storage region in at least one of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address, wherein N is an integer greater than or equal to zero.

Clause 6, the computing apparatus according toclause 1, the opcode of the computing instruction representing a plurality of operations to be performed by the array of processing circuits, the computing apparatus further comprising control circuitry configured to fetch the computing instruction and parse the computing instruction to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the opcode, and when an operand of the computing instruction includes the descriptor, the control circuitry is configured to determine a storage address of data corresponding to the operand according to the descriptor.

Clause 7, the computing device ofclause 6, wherein the control circuitry configures the array of processing circuits according to the plurality of operational instructions to obtain the plurality of sub-arrays of processing circuits.

Clause 8, the computing device ofclause 7, wherein the control circuit includes a register for storing configuration information, and the control circuit extracts corresponding configuration information according to the plurality of arithmetic instructions and configures the processing circuit array according to the configuration information to obtain the plurality of processing circuit sub-arrays.

Clause 9, the computing device according toclause 1, wherein the plurality of operation instructions includes at least one multi-stage pipelined operation, and wherein the one multi-stage pipelined operation includes at least two operation instructions.

Clause 10, the computing apparatus ofclause 1, wherein the operational instructions include a predicate, and each of the processing circuits determines whether to execute the operational instruction associated therewith in accordance with the predicate.

Clause 11, the computing device ofclause 1, wherein the array of processing circuits is a one-dimensional array, and one or more processing circuits in the array of processing circuits are configured as a sub-array of the processing circuits.

Clause 12, the computing device ofclause 1, wherein the array of processing circuits is a two-dimensional array, and wherein:

one or more rows of processing circuits in the array of processing circuits are configured as a sub-array of the processing circuits; or

One or more columns of processing circuits in the array of processing circuits are configured to act as a sub-array of the processing circuits; or

One or more rows of processing circuits in the array of processing circuits in a diagonal direction are configured as one sub-array of the processing circuits.

Clause 13, the computing device ofclause 12, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected in at least one of their row, column, or diagonal directions with a predetermined two-dimensional spacing pattern with the remaining one or more of the processing circuits in the same row, column, or diagonal.

Clause 14, the computing device ofclause 13, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 15, the computing device ofclause 1, wherein the processing circuit array is a three-dimensional array, and a three-dimensional sub-array or a plurality of three-dimensional sub-arrays in the processing circuit array are configured to act as one sub-array of the processing circuit.

Clause 16, the computing device ofclause 15, wherein the three-dimensional array is a three-dimensional array comprised of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, wherein:

the processing circuits located in the three-dimensional array are configured to be connected with the remaining one or more processing circuits in the same row, the same column, the same diagonal, or on a different layer in at least one of a row direction, a column direction, a diagonal direction, and a layer direction thereof in a predetermined three-dimensional spacing pattern.

Clause 17, the computing device ofclause 16, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of spaced layers between the processing circuits to be connected.

Clause 18, the computing device of any one of clauses 11-17, wherein the plurality of processing circuits in the sub-array of processing circuits form one or more closed loops.

Clause 19, the computing device ofclause 1, wherein each of the sub-arrays of processing circuitry is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.

Clause 20, the computing device ofclause 1, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the pre-manipulation circuitry is configured to perform pre-processing of input data of at least one of the operational instructions and the post-manipulation circuitry is configured to perform post-processing of output data of at least one of the operational instructions.

Clause 21, the computing device of clause 20, wherein the preprocessing comprises data placement and/or table lookup operations, and the post-processing comprises data type conversion and/or compression operations.

Clause 22 andclause 21, wherein the data arrangement includes splitting or combining the input data and/or the output data of the operation instruction according to the data type of the input data and/or the output data, and then transmitting the split or combined input data and/or output data to the corresponding processing circuit for operation.

Clause 23, an integrated circuit chip comprising the computing device of any one of clauses 1-22.

Clause 24, a board comprising the integrated circuit chip of clause 23.

Clause 25, an electronic device, comprising the integrated circuit chip of clause 23.

Clause 26, a method of performing a computation using a computing device, wherein the computing device includes an array of processing circuits connected by a plurality of processing circuits in a one-dimensional or multi-dimensional array configuration, and the array of processing circuits is configured as a plurality of sub-arrays of processing circuits, the method comprising:

receiving a computation instruction at the computing device and parsing it to obtain a plurality of operation instructions, wherein an operand of the computation instruction comprises a descriptor for indicating a shape of a tensor, the descriptor for determining a storage address of data corresponding to the operand;

in response to receiving the plurality of operation instructions, performing a multi-threaded operation with the plurality of sub-arrays of processing circuitry, wherein at least one sub-array of processing circuitry of the plurality of sub-arrays of processing circuitry is configured to execute at least one of a plurality of operation instructions according to the memory address.

Clause 27, the method of clause 26, wherein the computing instruction comprises an identification of a descriptor and/or a content of a descriptor comprising at least one shape parameter representing a shape of the tensor data.

Clause 28, the method of clause 27, wherein the contents of the descriptor further comprise at least one address parameter representing an address of tensor data.

Clause 29, the method of clause 28, wherein the address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in the data storage space of the tensor data.

Clause 30, the method of clause 29, wherein the shape parameters of the tensor data comprise at least one of:

Clause 31, the method of clause 26, wherein the opcode of the computation instruction represents a plurality of operations performed by the array of processing circuits, the computing device further comprising control circuitry, the method comprising fetching the computation instruction with the control circuitry and parsing the computation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the opcode.

Clause 32, the method of clause 31, wherein the control circuitry is utilized to configure the processing circuit array according to the plurality of operational instructions to obtain the plurality of processing circuit sub-arrays.

Clause 33, the method ofclause 32, wherein the control circuit includes a register for storing configuration information, and the method includes utilizing the control circuit to extract corresponding configuration information according to the plurality of arithmetic instructions and to configure the array of processing circuits according to the configuration information to obtain the plurality of sub-arrays of processing circuits.

Clause 34, the method of clause 26, wherein the plurality of operation instructions comprises at least one multi-stage pipelined operation comprising at least two operation instructions.

Clause 35, the method of clause 26, wherein the operational instructions comprise a predicate, and the method further comprises determining, with each of the processing circuits, whether to execute the operational instruction associated therewith in accordance with the predicate.

Clause 36, the method of clause 26, wherein the array of processing circuits is a one-dimensional array, and the method comprises configuring one or more processing circuits in the array of processing circuits as a sub-array of the processing circuits.

Clause 37, the method of clause 26, wherein the array of processing circuits is a two-dimensional array, and the method further comprises:

configuring one or more rows of processing circuits in the array of processing circuits as a sub-array of the processing circuits; or

Configuring one or more columns of processing circuits in the array of processing circuits as a sub-array of processing circuits; or

Configuring one or more rows of processing circuits in a diagonal direction in the processing circuit array as one sub-array of the processing circuits.

Clause 38, the method of clause 37, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected in at least one of their row, column or diagonal directions with a predetermined two-dimensional spacing pattern with the remaining one or more of the processing circuits in the same row, column or diagonal.

Clause 39, the method of clause 38, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

Clause 40, the method of clause 26, wherein the processing circuit array is a three-dimensional array, and the method comprises configuring a three-dimensional sub-array or a plurality of three-dimensional sub-arrays in the processing circuit array as one sub-array of the processing circuit.

Clause 41, the method of clause 40, wherein the three-dimensional array is a three-dimensional array comprised of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, the method comprising:

configuring the processing circuits located in the three-dimensional array to be connected in at least one of a row direction, a column direction, a diagonal direction, and a layer direction thereof with the remaining one or more processing circuits in the same row, the same column, the same diagonal, or a different layer in a predetermined three-dimensional spacing pattern.

Clause 42, the method of clause 41, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of spacing layers between processing circuits to be connected.

Clause 43, the method of any one of clauses 36-42, wherein the plurality of processing circuits in the sub-array of processing circuits form one or more closed loops.

Clause 44, the method of clause 26, wherein each of the sub-arrays of processing circuitry is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.

Clause 45, the method of clause 26, further comprising a data manipulation circuit comprising a pre-manipulation circuit and/or a post-manipulation circuit, the method comprising performing pre-processing of input data of at least one of the operational instructions with the pre-manipulation circuit and/or performing post-processing of output data of at least one of the operational instructions with the post-manipulation circuit.

Clause 46, the method of clause 45, wherein the preprocessing comprises operations directed to data placement and/or table lookup, and the post-processing comprises data type conversion and/or compression operations.

Clause 47 and clause 46, wherein the data arrangement includes splitting or combining the input data and/or the output data of the operation instruction according to the data type of the input data and/or the output data, and then transmitting the split or combined input data and/or output data to the corresponding processing circuit for operation. .

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A computing device, comprising:

2. The computing device of claim 1, wherein the computing instructions comprise an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

3. The computing device of claim 2, wherein contents of the descriptor further include at least one address parameter representing an address of tensor data.

4. The computing device of claim 3, wherein address parameters of the tensor data comprise a base address of a data reference point of the descriptor in a data storage space of the tensor data.

5. The computing device of claim 4, wherein shape parameters of the tensor data comprise at least one of:

6. The computing device of claim 1, the opcode of the computing instruction being representative of a plurality of operations performed by the array of processing circuits, the computing device further comprising control circuitry configured to fetch and parse the computing instruction to obtain the plurality of operational instructions corresponding to the plurality of operations represented by the opcode, and when an operand of the computing instruction includes the descriptor, the control circuitry is configured to determine a storage address for data corresponding to the operand from the descriptor.

7. The computing device of claim 6, wherein the control circuitry configures the processing circuit array according to the plurality of operational instructions to obtain the plurality of processing circuit sub-arrays.

8. The computing device of claim 7, wherein the control circuitry includes registers to store configuration information, and the control circuitry extracts corresponding configuration information from the plurality of operational instructions and configures the processing circuit array to obtain the plurality of processing circuit sub-arrays according to the configuration information.

9. The computing device of claim 1, the plurality of operation instructions comprising at least one multi-stage pipelined operation comprising at least two operation instructions.

10. The computing apparatus of claim 1 wherein the operational instructions comprise predicates and each of the processing circuits determines whether to execute the operational instructions associated therewith in dependence upon the predicates.

11. The computing device of claim 1, wherein the array of processing circuits is a one-dimensional array, and one or more processing circuits in the array of processing circuits are configured as one sub-array of processing circuits.

12. The computing device of claim 1, wherein the array of processing circuits is a two-dimensional array, and wherein:

13. The computing device of claim 12, wherein the plurality of processing circuits located in the two-dimensional array are configured to connect with a remaining one or more of the processing circuits in a same row, a same column, or a same diagonal in at least one of their row, column, or diagonal directions in a predetermined two-dimensional spacing pattern.

14. The computing device of claim 13, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

15. The computing device of claim 1, wherein the processing circuit array is a three-dimensional array, and a three-dimensional sub-array or a plurality of three-dimensional sub-arrays in the processing circuit array is configured to be one of the processing circuit sub-arrays.

16. The computing device of claim 15, wherein the three-dimensional array is a three-dimensional array of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, wherein:

17. The computing device of claim 16, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of layers of spaces between processing circuits to be connected.

18. The computing device of any of claims 11-17, wherein a plurality of processing circuits in the processing circuit sub-array form one or more closed loops.

19. The computing device of claim 1, wherein each of the sub-arrays of processing circuitry is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.

20. The computing device of claim 1, further comprising data manipulation circuitry comprising pre-manipulation circuitry and/or post-manipulation circuitry, wherein the pre-manipulation circuitry is configured to perform pre-processing of input data of at least one of the operational instructions and the post-manipulation circuitry is configured to perform post-processing of output data of at least one of the operational instructions.

21. The computing device of claim 20, wherein the pre-processing comprises data placement and/or table lookup operations and the post-processing comprises data type conversion and/or compression operations.

22. The computing device of claim 21, wherein the data arrangement includes, according to a data type of input data and/or output data of the operation instruction, splitting or combining the input data and/or the output data, and then transmitting the split or combined input data and/or output data to a corresponding processing circuit for operation.

23. An integrated circuit chip comprising the computing device of any of claims 1-22.

24. A board card comprising the integrated circuit chip of claim 23.

25. An electronic device comprising the integrated circuit chip of claim 23.

26. A method of performing a computation using a computing device, wherein the computing device includes a processing circuit array formed by a plurality of processing circuits connected in a one-dimensional or multi-dimensional array configuration, and the processing circuit array is configured as a plurality of processing circuit sub-arrays, the method comprising:

27. The method of claim 26, wherein the computation instruction comprises an identification of a descriptor and/or content of a descriptor comprising at least one shape parameter representing a shape of tensor data.

28. The method of claim 27, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.

29. The method of claim 28, wherein address parameters of the tensor data comprise a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

30. The method of claim 29, wherein shape parameters of the tensor data comprise at least one of:

31. The method of claim 26, wherein an opcode of the computation instruction represents a plurality of operations to be performed by the array of processing circuits, the computing device further comprising control circuitry, the method comprising fetching and parsing the computation instruction with the control circuitry to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the opcode.

32. The method of claim 31, wherein the control circuitry is utilized to configure the processing circuit array according to the plurality of operational instructions to obtain the plurality of processing circuit sub-arrays.

33. The method of claim 32, wherein the control circuitry includes registers for storing configuration information, and the method includes utilizing the control circuitry to extract corresponding configuration information from the plurality of operational instructions and configure the array of processing circuits according to the configuration information to obtain the plurality of sub-arrays of processing circuits.

34. The method of claim 26, wherein the plurality of operation instructions comprises at least one multi-stage pipelined operation comprising at least two operation instructions.

35. The method of claim 26, wherein the operational instructions include a predicate, and the method further comprises determining, with each of the processing circuits, whether to execute the operational instruction associated therewith based on the predicate.

36. The method of claim 26, wherein the array of processing circuits is a one-dimensional array, and the method comprises configuring one or more processing circuits in the array of processing circuits as one sub-array of processing circuits.

37. The method of claim 26, wherein the array of processing circuits is a two-dimensional array, and the method further comprises:

38. The method of claim 37, wherein the plurality of processing circuits located in the two-dimensional array are configured to be connected in at least one of a row direction, a column direction, or a diagonal direction thereof with a predetermined two-dimensional spacing pattern with a remaining one or more of the processing circuits in a same row, a same column, or a same diagonal.

39. The method of claim 38, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.

40. The method of claim 26, wherein the array of processing circuits is a three-dimensional array, and the method comprises configuring a three-dimensional sub-array or a plurality of three-dimensional sub-arrays in the array of processing circuits as one sub-array of processing circuits.

41. The method of claim 40, wherein the three-dimensional array is a three-dimensional array of a plurality of layers, wherein each layer comprises a two-dimensional array of a plurality of the processing circuits arranged in a row direction, a column direction, and a diagonal direction, the method comprising:

42. The method of claim 41, wherein the predetermined three-dimensional spacing pattern is associated with a number of spaces and a number of spacing layers between processing circuits to be connected.

43. The method of any of claims 36-42, wherein a plurality of processing circuits in the sub-array of processing circuits form one or more closed loops.

44. The method of claim 26, wherein each of said sub-arrays of processing circuitry is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table lookup operations.

45. The method of claim 26, further comprising a data manipulation circuit comprising a pre-manipulation circuit and/or a post-manipulation circuit, the method comprising performing pre-processing of input data of at least one of the operational instructions with the pre-manipulation circuit and/or performing post-processing of output data of at least one of the operational instructions with the post-manipulation circuit.

46. The method of claim 45, wherein the pre-processing comprises operations directed to data placement and/or table lookup, and the post-processing comprises data type conversion and/or compression operations.

47. The method of claim 46, wherein the data arrangement comprises, according to a data type of input data and/or output data of the operation instruction, splitting or combining the input data and/or the output data, and then transmitting the split or combined input data and/or output data to a corresponding processing circuit for operation.