Movatterモバイル変換


[0]ホーム

URL:


CN118690803A - Graph neural network acceleration method and graph neural network acceleration structure - Google Patents

Graph neural network acceleration method and graph neural network acceleration structure
Download PDF

Info

Publication number
CN118690803A
CN118690803ACN202410693570.7ACN202410693570ACN118690803ACN 118690803 ACN118690803 ACN 118690803ACN 202410693570 ACN202410693570 ACN 202410693570ACN 118690803 ACN118690803 ACN 118690803A
Authority
CN
China
Prior art keywords
matrix
row vector
row
vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410693570.7A
Other languages
Chinese (zh)
Inventor
蒋东东
董刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co LtdfiledCriticalSuzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410693570.7ApriorityCriticalpatent/CN118690803A/en
Publication of CN118690803ApublicationCriticalpatent/CN118690803A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请实施例提供了一种图神经网络的加速方法以及图神经网络加速结构,涉及深度学习技术领域,其中,上述方法包括:按照存储地址从外部存储器中顺序读取压缩特征矩阵;对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,指定矩阵运算包括第一矩阵运算;按照存储地址从外部存储器中顺序读取压缩关系矩阵,其中,压缩关系矩阵是对与图结构数据对应的指定关系矩阵进行数据压缩得到的,指定关系矩阵用于描述图结构数据中的节点之间的关系;对读取的压缩关系矩阵和中间矩阵执行第二矩阵运算,得到目标矩阵,其中,指定矩阵运算还包括第二矩阵运算,通过本申请,解决了相关技术中的图神经网络的加速方法存在加速效率低的问题。

An embodiment of the present application provides a method for accelerating a graph neural network and a graph neural network acceleration structure, which relate to the field of deep learning technology, wherein the method includes: sequentially reading a compressed feature matrix from an external memory according to a storage address; performing a first matrix operation on the read compressed feature matrix and a weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes a first matrix operation; sequentially reading a compressed relationship matrix from an external memory according to a storage address, wherein the compressed relationship matrix is obtained by compressing a specified relationship matrix corresponding to graph structure data, and the specified relationship matrix is used to describe the relationship between nodes in the graph structure data; performing a second matrix operation on the read compressed relationship matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes a second matrix operation. Through the present application, the problem of low acceleration efficiency in the acceleration method of the graph neural network in the related art is solved.

Description

Translated fromChinese
图神经网络的加速方法以及图神经网络加速结构Graph neural network acceleration method and graph neural network acceleration structure

技术领域Technical Field

本申请实施例涉及深度学习技术领域,具体而言,涉及一种图神经网络的加速方法以及图神经网络加速结构。The embodiments of the present application relate to the field of deep learning technology, and more specifically, to a graph neural network acceleration method and a graph neural network acceleration structure.

背景技术Background Art

图神经网络的加速方法通常是对每层网络中的两个阶段的矩阵运算进行优化,即,聚合和组合更新,聚合和组合(更新)需要连续计算三个矩阵,分别是:关系矩阵,特征矩阵,权重矩阵。其中,关系矩阵和特征矩阵属于稀疏矩阵,是指其非零元素只占矩阵元素总数的小部分,其余大部分为零;其中关系矩阵为极稀疏矩阵,是指非零元素所占矩阵比例不到1%;权重矩阵通常是密集矩阵。The acceleration method of graph neural networks is usually to optimize the matrix operations of two stages in each layer of the network, that is, aggregation and combination update. Aggregation and combination (update) require continuous calculation of three matrices, namely: relationship matrix, feature matrix, and weight matrix. Among them, the relationship matrix and feature matrix are sparse matrices, which means that their non-zero elements only account for a small part of the total number of matrix elements, and most of the rest are zero; the relationship matrix is an extremely sparse matrix, which means that the proportion of non-zero elements in the matrix is less than 1%; the weight matrix is usually a dense matrix.

两种矩阵计算的输入数据的稀疏度,缓存的需求量以及数据的可复用性,均有很大的不同,且由于在矩阵计算中,数据需要从内存中加载到处理器(如CPU或GPU)中进行计算,内存和处理器之间的数据传输速度差异,当处理器等待数据从内存加载时,计算任务无法充分利用处理器的计算能力,导致性能瓶颈,即,内存墙问题,目前,相关技术中的图神经网络的加速方法存在加速效率低的问题。The sparsity of the input data, the cache requirements, and the data reusability of the two matrix calculations are very different. In addition, in matrix calculations, data needs to be loaded from the memory to the processor (such as the CPU or GPU) for calculation. Due to the difference in data transmission speed between the memory and the processor, when the processor is waiting for data to be loaded from the memory, the computing task cannot fully utilize the computing power of the processor, resulting in a performance bottleneck, that is, the memory wall problem. At present, the acceleration method of graph neural networks in related technologies has the problem of low acceleration efficiency.

发明内容Summary of the invention

本申请实施例提供了一种图神经网络的加速方法以及图神经网络加速结构,以至少解决相关技术中的图神经网络的加速方法存在加速效率低的问题。The embodiments of the present application provide a graph neural network acceleration method and a graph neural network acceleration structure to at least solve the problem of low acceleration efficiency in the graph neural network acceleration method in the related art.

根据本申请的一个实施例,提供了一种图神经网络的加速方法,应用于图神经网络加速结构,其中,所述图神经网络用于处理图结构数据,所述图神经网络包括隐藏层,所述隐藏层用于执行指定矩阵运算,所述图神经网络加速结构包括图神经网络加速器和所述图神经网络加速器的外部存储器,所述图神经网络加速器用于实现所述指定矩阵运算;所述方法包括:按照存储地址从所述外部存储器中顺序读取压缩特征矩阵,其中,所述压缩特征矩阵是对与所述图结构数据对应的指定特征矩阵进行数据压缩得到的;对读取的所述压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,所述指定矩阵运算包括所述第一矩阵运算;按照存储地址从所述外部存储器中顺序读取压缩关系矩阵,其中,所述压缩关系矩阵是对与所述图结构数据对应的指定关系矩阵进行数据压缩得到的,所述指定关系矩阵用于描述所述图结构数据中的节点之间的关系;对读取的所述压缩关系矩阵和所述中间矩阵执行第二矩阵运算,得到目标矩阵,其中,所述指定矩阵运算还包括所述第二矩阵运算。According to an embodiment of the present application, a method for accelerating a graph neural network is provided, which is applied to a graph neural network acceleration structure, wherein the graph neural network is used to process graph structure data, the graph neural network includes a hidden layer, the hidden layer is used to perform specified matrix operations, the graph neural network acceleration structure includes a graph neural network accelerator and an external memory of the graph neural network accelerator, and the graph neural network accelerator is used to implement the specified matrix operations; the method includes: sequentially reading a compressed feature matrix from the external memory according to a storage address, wherein the compressed feature matrix is obtained by compressing data of a specified feature matrix corresponding to the graph structure data; performing a first matrix operation on the read compressed feature matrix and a weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes the first matrix operation; sequentially reading a compressed relationship matrix from the external memory according to a storage address, wherein the compressed relationship matrix is obtained by compressing data of a specified relationship matrix corresponding to the graph structure data, and the specified relationship matrix is used to describe the relationship between nodes in the graph structure data; performing a second matrix operation on the read compressed relationship matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes the second matrix operation.

根据本申请的另一个实施例,提供了一种图神经网络加速结构,所述图神经网络用于处理图结构数据,所述图神经网络包括隐藏层,所述隐藏层用于执行指定矩阵运算,所述图神经网络加速结构包括图神经网络加速器,所述图神经网络加速器用于实现所述指定矩阵运算;包括:所述图神经网络加速器,用于按照存储地址从所述外部存储器中顺序读取压缩特征矩阵,其中,所述压缩特征矩阵是对与所述图结构数据对应的指定特征矩阵进行数据压缩得到的;对读取的所述压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,所述指定矩阵运算包括所述第一矩阵运算;按照存储地址从所述外部存储器中顺序读取压缩关系矩阵,其中,所述压缩关系矩阵是对与所述图结构数据对应的指定关系矩阵进行数据压缩得到的,所述指定关系矩阵用于描述所述图结构数据中的节点之间的关系;对读取的所述压缩关系矩阵和所述中间矩阵执行第二矩阵运算,得到目标矩阵,其中,所述指定矩阵运算还包括所述第二矩阵运算。According to another embodiment of the present application, a graph neural network acceleration structure is provided, wherein the graph neural network is used to process graph structure data, the graph neural network includes a hidden layer, the hidden layer is used to perform specified matrix operations, the graph neural network acceleration structure includes a graph neural network accelerator, and the graph neural network accelerator is used to implement the specified matrix operations; including: the graph neural network accelerator is used to sequentially read a compressed feature matrix from the external memory according to a storage address, wherein the compressed feature matrix is obtained by data compression of a specified feature matrix corresponding to the graph structure data; perform a first matrix operation on the read compressed feature matrix and a weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes the first matrix operation; sequentially read a compressed relationship matrix from the external memory according to a storage address, wherein the compressed relationship matrix is obtained by data compression of a specified relationship matrix corresponding to the graph structure data, and the specified relationship matrix is used to describe the relationship between nodes in the graph structure data; perform a second matrix operation on the read compressed relationship matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes the second matrix operation.

根据本申请实施例的又一个方面,提供了一种计算机可读的存储介质,计算机可读的存储介质包括存储的程序,其中,程序运行时执行上述任一项方法实施例中的步骤。According to another aspect of the embodiments of the present application, a computer-readable storage medium is provided, the computer-readable storage medium including a stored program, wherein the program executes the steps of any of the above method embodiments when it is run.

根据本申请实施例的又一个方面,提供了一种电子设备,包括存储器和处理器,存储器中存储有计算机程序,处理器被设置为通过计算机程序执行上述任一项方法实施例中的步骤。According to another aspect of the embodiments of the present application, an electronic device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to execute the steps in any one of the above method embodiments through the computer program.

根据本申请的又一个方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现上述任一项方法实施例中的步骤。According to another aspect of the present application, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the steps in any one of the above method embodiments are implemented.

通过本申请实施例,采用调整矩阵计算顺序,按照地址顺序读取外部存储器中的数据的方式,由于调整矩阵计算顺序后的两步矩阵运算均为稀疏-密集矩阵乘法,可以利用稀疏矩阵的稀疏性,对稀疏矩阵进行压缩并存储到外部存储器中,降低存储空间,由于采取按照存储地址顺序读取外部存储器中的数据的顺序读取方式,可以减轻内存墙问题,提高数据读取效率,解决相关技术中的图神经网络的加速方法存在加速效率低的问题。Through the embodiments of the present application, the matrix calculation order is adjusted, and the data in the external memory is read in address order. Since the two-step matrix operations after adjusting the matrix calculation order are sparse-dense matrix multiplication, the sparsity of the sparse matrix can be used to compress the sparse matrix and store it in the external memory, thereby reducing storage space. Since a sequential reading method is adopted to read the data in the external memory in storage address order, the memory wall problem can be alleviated, the data reading efficiency can be improved, and the problem of low acceleration efficiency in the acceleration method of the graph neural network in the related art can be solved.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是根据本申请实施例的一种图神经网络的加速方法的流程示意图。FIG1 is a flow chart of a method for accelerating a graph neural network according to an embodiment of the present application.

图2是根据本申请实施例的一种聚合和组合更新的示意图。FIG. 2 is a schematic diagram of an aggregation and combination update according to an embodiment of the present application.

图3是根据本申请实施例的一种HyGCN(具有混合架构的图卷积网络加速器)的示意图。Figure 3 is a schematic diagram of a HyGCN (graph convolutional network accelerator with a hybrid architecture) according to an embodiment of the present application.

图4是根据本申请实施例的一种COO压缩的示意图。FIG. 4 is a schematic diagram of COO compression according to an embodiment of the present application.

图5是根据本申请实施例的一种将行地址修改为行尾有效标志的示意图。FIG. 5 is a schematic diagram of modifying a row address to a valid mark at the end of a row according to an embodiment of the present application.

图6是根据本申请实施例的一种图神经网络结构的示意图。Figure 6 is a schematic diagram of a graph neural network structure according to an embodiment of the present application.

图7是根据本申请实施例的一种单隐藏层中的两个矩阵计算的示意图。FIG. 7 is a schematic diagram of two matrix calculations in a single hidden layer according to an embodiment of the present application.

图8是根据本申请实施例的另一种单隐藏层中的两个矩阵计算的示意图。FIG8 is a schematic diagram of another calculation of two matrices in a single hidden layer according to an embodiment of the present application.

图9是根据本申请实施例的一种矩阵行乘的示意图。FIG. 9 is a schematic diagram of a matrix row multiplication according to an embodiment of the present application.

图10是根据本申请实施例的一种行卷积计算的示意图。FIG10 is a schematic diagram of a row convolution calculation according to an embodiment of the present application.

图11是根据本申请实施例的一种乘加单元和缓存的示意图。FIG11 is a schematic diagram of a multiplication-addition unit and a cache according to an embodiment of the present application.

图12是根据本申请实施例的一种XW行乘的示意图。FIG. 12 is a schematic diagram of an XW row multiplication according to an embodiment of the present application.

图13是根据本申请实施例的一种XW存储与计算方式的示意图。FIG. 13 is a schematic diagram of an XW storage and calculation method according to an embodiment of the present application.

图14是根据本申请实施例的一种图神经网络加速结构的示意图。Figure 14 is a schematic diagram of a graph neural network acceleration structure according to an embodiment of the present application.

图15是本申请实施例提供的一种可选的电子设备的计算机系统的结构框图。FIG. 15 is a structural block diagram of a computer system of an optional electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下文中将参考附图并结合实施例来详细说明本申请的实施例。The embodiments of the present application will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

需要说明的是,本申请实施例的说明书和权利要求书及上述附图中的术语″第一″、″第二″等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first", "second", etc. in the description and claims of the embodiments of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as those commonly understood by those skilled in the art to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of this application and are not intended to limit this application.

在本实施例中提供了一种图神经网络的加速方法,图1是根据本申请实施例的一种图神经网络的加速方法的流程示意图,如图1所示,该流程包括如下步骤:In this embodiment, a method for accelerating a graph neural network is provided. FIG1 is a schematic diagram of a process of accelerating a graph neural network according to an embodiment of the present application. As shown in FIG1 , the process includes the following steps:

步骤S102,按照存储地址从外部存储器中顺序读取压缩特征矩阵,其中,压缩特征矩阵是对与图结构数据对应的指定特征矩阵进行数据压缩得到的。Step S102, sequentially reading a compressed feature matrix from an external memory according to a storage address, wherein the compressed feature matrix is obtained by performing data compression on a designated feature matrix corresponding to the graph structure data.

本实施例中的图神经网络的加速方法可以应用到图神经网络中,图神经网络是一种处理图结构数据的深度学习模型,它通过在图结构上进行信息传播和特征学习来解决各种任务,如节点分类、图分类、链接预测等。图神经网络可以处理各种类型的图,包括无向图、有向图、加权图和多图等。The acceleration method of the graph neural network in this embodiment can be applied to the graph neural network, which is a deep learning model for processing graph structure data. It solves various tasks such as node classification, graph classification, link prediction, etc. by performing information propagation and feature learning on the graph structure. The graph neural network can process various types of graphs, including undirected graphs, directed graphs, weighted graphs, and multi-graphs.

图卷积神经网络是一种特定类型的图神经网络,主要应用于输入数据有相关性的人工智能领域,如人际关系,视频推荐等。和其他的人工智能网络相比,图卷积神经网络独有的特点是在每层网络中,需要进行2个阶段的矩阵计算,即聚合(AX)和组合更新(AXW),两种矩阵计算的输入数据的稀疏度,缓存的需求量以及数据的可复用性,都有很大的不同。Graph convolutional neural network is a specific type of graph neural network, which is mainly used in artificial intelligence fields where input data is correlated, such as interpersonal relationships, video recommendations, etc. Compared with other artificial intelligence networks, the unique feature of graph convolutional neural network is that in each layer of the network, two stages of matrix calculation are required, namely aggregation (AX) and combination update (AXW). The sparsity of input data, cache requirements and data reusability of the two matrix calculations are very different.

其中,如图2,特征矩阵(即,feature原始数据)X,根据场景的不同,可能稀疏,也可能密集,数据量大小可为上亿个点,虽然过大可以切割,但是也不能切割过小,不然会损失数据间关联性。关系矩阵A为极稀疏矩阵,数据量大小可为上亿×上亿大小。权重矩阵W,为密集矩阵,数据量较小,可以缓存在片上。As shown in Figure 2, the feature matrix (i.e., feature raw data) X may be sparse or dense depending on the scenario, and the data size may be hundreds of millions of points. Although it can be cut if it is too large, it cannot be cut too small, otherwise the correlation between the data will be lost. The relationship matrix A is an extremely sparse matrix, and the data size can be hundreds of millions × hundreds of millions. The weight matrix W is a dense matrix with a small amount of data and can be cached on the chip.

由于两个阶段的计算数据稀疏度和缓存要求极为不同,图神经网络加速结构通常采用以下两种结构。Since the computational data sparsity and cache requirements of the two stages are extremely different, the graph neural network acceleration structure usually adopts the following two structures.

一、通用的GPU等单核心加速结构,这样做的好处是只需要单一类别的加速单元,但是由于其是针对通用计算的,其无法充分针对图神经网络的特点,利用两个计算步骤对资源要求的不同进行整体优化,从而导致内存及数据带宽性能的严重浪费。1. Single-core acceleration structures such as general-purpose GPUs. The advantage of this is that only a single type of acceleration unit is required. However, since it is aimed at general computing, it cannot fully address the characteristics of graph neural networks and use two computing steps to optimize the overall resource requirements, resulting in serious waste of memory and data bandwidth performance.

二、两个独立的硬件计算单元,如图3的HyGCN,这样做的好处是可以针对2个阶段的不同点进行针对性的优化,但是这样做的问题是两个加速器的负载不均匀,导致一个加速器负载很高,而另一个加速器等待的情况,浪费了硬件资源。2. Two independent hardware computing units, such as HyGCN in Figure 3. The advantage of this is that targeted optimization can be performed on the different points of the two stages. However, the problem with this is that the loads of the two accelerators are uneven, resulting in one accelerator being highly loaded while the other accelerator is waiting, wasting hardware resources.

为了降低硬件资源,在本实施例中,通过单一加速器结构兼容两步矩阵计算,为了利用矩阵的稀疏性,降低外部存储器的数据带宽需求,调整计算顺序为先计算XW,再计算A(XW),即,使得两步矩阵计算都是稀疏矩阵与稠密矩阵相乘。In order to reduce hardware resources, in this embodiment, a two-step matrix calculation is compatible with a single accelerator structure. In order to utilize the sparsity of the matrix and reduce the data bandwidth requirement of the external memory, the calculation order is adjusted to first calculate XW and then calculate A(XW), that is, the two-step matrix calculation is the multiplication of a sparse matrix and a dense matrix.

稀疏矩阵是指一个矩阵中大部分元素为零的矩阵,通常占据大量的存储空间,在处理稀疏矩阵时们可以对稀疏矩阵进行压缩。本实施例中以指定特征矩阵X为稀疏矩阵为例,压缩特征矩阵是对与图结构数据对应的指定特征矩阵进行数据压缩得到的,数据压缩的方式包括但不限于以下方式的至少之一:A sparse matrix refers to a matrix in which most of the elements are zero, which usually occupies a large amount of storage space. When processing a sparse matrix, we can compress the sparse matrix. In this embodiment, taking the specified feature matrix X as a sparse matrix as an example, the compressed feature matrix is obtained by compressing the specified feature matrix corresponding to the graph structure data. The data compression method includes but is not limited to at least one of the following methods:

坐标列表(COO)格式:这种格式存储了矩阵中非零元素的行索引、列索引和值,适用于稀疏矩阵的初步压缩,因为只存储非零元素。Coordinate list (COO) format: This format stores the row index, column index, and value of the non-zero elements in the matrix. It is suitable for preliminary compression of sparse matrices because only non-zero elements are stored.

压缩稀疏行(CSR)格式:CSR格式将稀疏矩阵表示为三个数组:一个包含非零元素的数组,一个包含行开始位置的数组,和一个包含列索引的数组,适用于快速矩阵向量乘法和矩阵转置操作。Compressed Sparse Row (CSR) format: The CSR format represents a sparse matrix as three arrays: an array containing nonzero elements, an array containing the start positions of rows, and an array containing column indices, suitable for fast matrix-vector multiplication and matrix transpose operations.

压缩稀疏列(CSC)格式:CSC格式与CSR类似,但列索引和列开始位置的数组被存储,这使得列向量乘法和矩阵转置操作更加高效。Compressed Sparse Column (CSC) format: The CSC format is similar to CSR, but arrays of column indices and column start positions are stored, which makes column-vector multiplication and matrix transpose operations more efficient.

三元组(Triplet)格式:这种格式类似于COO格式,但通常用于矩阵的构造阶段,因为可以随时添加新元素。Triplet format: This format is similar to the COO format, but is usually used during the construction phase of a matrix because new elements can be added at any time.

块压缩稀疏行(BSR)格式:BSR是CSR的一个变体,但它将矩阵分割成大小相同的子块,每个子块作为一个元素存储。这种格式适合于多核和向量处理器,可以减少缓存失效。Block Compressed Sparse Row (BSR) format: BSR is a variant of CSR, but it divides the matrix into sub-blocks of equal size and stores each sub-block as an element. This format is suitable for multi-core and vector processors and can reduce cache misses.

哈希表:使用哈希表可以将稀疏矩阵的非零元素存储在一个动态的数据结构中,这可以提高元素的插入和查找效率。Hash table: A hash table can be used to store the non-zero elements of a sparse matrix in a dynamic data structure, which can improve the efficiency of element insertion and search.

二叉决策图(BDD):对于特定的稀疏矩阵,如二进制矩阵,可以使用二叉决策图来存储,这种数据结构可以有效地压缩和操作稀疏矩阵。Binary Decision Diagram (BDD): For certain sparse matrices, such as binary matrices, a binary decision diagram can be used to store them. This data structure can efficiently compress and operate sparse matrices.

稀疏矩阵包:利用数学包如LAPACK,BLAS,SPARSKIT等,它们提供了多种稀疏矩阵存储和操作的算法。Sparse matrix packages: Use mathematical packages such as LAPACK, BLAS, SPARSKIT, etc., which provide a variety of algorithms for sparse matrix storage and manipulation.

数值方法压缩:对于一些特定的稀疏矩阵,如具有特定模式的矩阵,可以使用数值方法进行压缩,比如通过奇异值分解(SVD)来降低矩阵的秩。Numerical compression: For some specific sparse matrices, such as matrices with specific patterns, numerical methods can be used for compression, such as reducing the rank of the matrix through singular value decomposition (SVD).

压缩特征矩阵可以存储在外部存储器上,外部存储器可以是指片外存储器,包括DDR等。考虑到外部存储器如果按照地址进行顺序读取,地址(0/1/2/3....),读取速率可以达到读取速度的上限值,如果进行地址跳转读取,则读取速率减半,如果一个数据跳一个地址,则读取速率会降为最大值的10%。The compressed feature matrix can be stored in an external memory, which can refer to an off-chip memory, including DDR, etc. Considering that if the external memory is read sequentially according to the address, address (0/1/2/3...), the reading rate can reach the upper limit of the reading speed. If the address jumps to read, the reading rate is halved. If one data jumps one address, the reading rate will be reduced to 10% of the maximum value.

为了提高数据读取速度,减轻内存墙现象,可以按照存储地址从外部存储器中顺序读取压缩特征矩阵,顺序读取可以使得存储器控制器按照地址递增的顺序读取数据,可以更大程度地利用存储器的内部预取机制,从而提高数据传输速率;在顺序读取过程中,存储器可以预先加载即将访问的数据到缓存中,这样可以减少访问延迟,提高读取效率;顺序读取可以减少存储器内部的随机跳转,使得存储器的各个部分得到更均匀的使用,从而提高存储器的整体利用率。In order to improve the data reading speed and alleviate the memory wall phenomenon, the compressed feature matrix can be read sequentially from the external memory according to the storage address. Sequential reading allows the memory controller to read data in the order of increasing addresses, and can make greater use of the internal prefetch mechanism of the memory, thereby improving the data transmission rate; during the sequential reading process, the memory can pre-load the data to be accessed into the cache, which can reduce the access delay and improve the reading efficiency; sequential reading can reduce random jumps inside the memory, so that each part of the memory is used more evenly, thereby improving the overall utilization of the memory.

步骤S104,对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,指定矩阵运算包括第一矩阵运算。Step S104, performing a first matrix operation on the read compressed feature matrix and the weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes the first matrix operation.

这里,第一矩阵运算可以是指将读取的压缩特征矩阵X与权重矩阵W相乘得到中间矩阵,即计算XW。Here, the first matrix operation may refer to multiplying the read compressed feature matrix X with the weight matrix W to obtain an intermediate matrix, that is, calculating XW.

步骤S106,按照存储地址从外部存储器中顺序读取压缩关系矩阵,其中,压缩关系矩阵是对与图结构数据对应的指定关系矩阵进行数据压缩得到的,指定关系矩阵用于描述图结构数据中的节点之间的关系。Step S106, sequentially reading a compressed relationship matrix from an external memory according to a storage address, wherein the compressed relationship matrix is obtained by compressing data of a designated relationship matrix corresponding to the graph structure data, and the designated relationship matrix is used to describe the relationship between nodes in the graph structure data.

图结构数据由节点(vertices)和边(edges)组成,可以表示复杂的关系和交互。在图卷积神经网络中,关系矩阵是一个关键组件,用于捕捉图中节点之间的关系。指定关系矩阵A非常稀疏,与前述实施例类似的,为了利用稀疏矩阵的稀疏性,可以将与图结构数据对应的指定关系矩阵进行数据压缩,得到压缩关系矩阵。为了提高数据读取速度,可以按照存储地址从外部存储器中顺序读取压缩关系矩阵。Graph structure data consists of nodes (vertices) and edges (edges), which can represent complex relationships and interactions. In graph convolutional neural networks, the relationship matrix is a key component used to capture the relationship between nodes in the graph. The specified relationship matrix A is very sparse. Similar to the previous embodiment, in order to utilize the sparsity of the sparse matrix, the specified relationship matrix corresponding to the graph structure data can be compressed to obtain a compressed relationship matrix. In order to increase the data reading speed, the compressed relationship matrix can be read sequentially from the external memory according to the storage address.

步骤S108,对读取的压缩关系矩阵和中间矩阵执行第二矩阵运算,得到目标矩阵,其中,指定矩阵运算还包括第二矩阵运算。Step S108, performing a second matrix operation on the read compressed relation matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes a second matrix operation.

这里,第二矩阵运算可以是指将读取的压缩关系矩阵A与中间矩阵XW相乘得到中间矩阵,即计算A(XW)。Here, the second matrix operation may refer to multiplying the read compression relationship matrix A with the intermediate matrix XW to obtain the intermediate matrix, that is, calculating A(XW).

通过本申请提供的实施例的上述步骤,按照存储地址从外部存储器中顺序读取压缩特征矩阵,其中,压缩特征矩阵是对与图结构数据对应的指定特征矩阵进行数据压缩得到的;对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,指定矩阵运算包括第一矩阵运算;按照存储地址从外部存储器中顺序读取压缩关系矩阵,其中,压缩关系矩阵是对与图结构数据对应的指定关系矩阵进行数据压缩得到的,指定关系矩阵用于描述图结构数据中的节点之间的关系;对读取的压缩关系矩阵和中间矩阵执行第二矩阵运算,得到目标矩阵,其中,指定矩阵运算还包括第二矩阵运算,解决了相关技术中的图神经网络的加速方法存在加速效率低的问题。Through the above steps of the embodiment provided by the present application, a compressed feature matrix is sequentially read from an external memory according to a storage address, wherein the compressed feature matrix is obtained by compressing data on a specified feature matrix corresponding to the graph structure data; a first matrix operation is performed on the read compressed feature matrix and the weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes a first matrix operation; a compressed relationship matrix is sequentially read from an external memory according to the storage address, wherein the compressed relationship matrix is obtained by compressing data on a specified relationship matrix corresponding to the graph structure data, and the specified relationship matrix is used to describe the relationship between nodes in the graph structure data; a second matrix operation is performed on the read compressed relationship matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes a second matrix operation, which solves the problem of low acceleration efficiency in the acceleration method of the graph neural network in the related art.

在一个示例性实施例中,压缩特征矩阵和压缩关系矩阵均按行存储在外部存储器中,外部存储器中还存储有与压缩特征矩阵中的元素对应的第一行尾标志以及与压缩特征矩阵中的元素对应的第二行尾标志,其中,第一行尾标志用于标识对应的元素是否为一个行向量的结尾,第二行尾标志用于标识对应的元素是否为一个行向量的结尾;In an exemplary embodiment, the compressed feature matrix and the compressed relationship matrix are both stored in an external memory by row, and the external memory also stores a first row end flag corresponding to the elements in the compressed feature matrix and a second row end flag corresponding to the elements in the compressed feature matrix, wherein the first row end flag is used to identify whether the corresponding element is the end of a row vector, and the second row end flag is used to identify whether the corresponding element is the end of a row vector;

按照存储地址从外部存储器中顺序读取压缩特征矩阵,包括:The compressed feature matrix is sequentially read from the external memory according to the storage address, including:

S11,按照存储地址从外部存储器中顺序读取压缩特征矩阵中的元素,直到读取到对应的第一行尾标志的标志值为第一标志值,得到第一目标行向量,其中,第一标志值用于标识对应的元素为一个行向量的结尾,第一目标行向量为压缩特征矩阵中的一个行向量;按照存储地址从外部存储器中顺序读取压缩关系矩阵,包括:S11, sequentially reading elements in the compressed feature matrix from the external memory according to the storage address until the flag value of the corresponding first row end flag is read as the first flag value, and obtaining a first target row vector, wherein the first flag value is used to identify the corresponding element as the end of a row vector, and the first target row vector is a row vector in the compressed feature matrix; sequentially reading the compressed relationship matrix from the external memory according to the storage address, including:

S12,按照存储地址从外部存储器中顺序读取压缩关系矩阵中的元素,直到读取到对应的第二行尾标志的标志值为第二标志值,得到第二目标行向量,其中,第二标志值用于标识对应的元素为一个行向量的结尾,第二目标行向量为压缩关系矩阵中的一个行向量。S12, sequentially read the elements in the compression relationship matrix from the external memory according to the storage address until the flag value of the corresponding second row end flag is read as the second flag value, and obtain the second target row vector, wherein the second flag value is used to identify the corresponding element as the end of a row vector, and the second target row vector is a row vector in the compression relationship matrix.

本实施例中以采用COO格式对稀疏矩阵(Matrix)进行压缩为例,标准的COO格式如图4所示,需要存储非零元素(values)、行坐标(row indices)和列坐标(column indices),因此需要3个片外存储空间来储存数据、行坐标和列坐标,为了节省COO格式的存储空间,将记录数据行地址的数据格式,更改为纪录数据是否为行尾数据,因为COO已经去掉了0值数据,因此所有的数据及其对应的行列地址都是有效的,所以本方案中,将行地址修改为行尾有效标志,如果不是某一行的最后一列(行尾),则为0,如果是某一行的最后一列,则为1,如此既可满足计算需求,还可以将所需要的行地址存储空间降低为1/32,(地址是32bit,行尾有效标志只需要1bit),如图5所示,其中,字节(byte),一个字节通常由8位二进制数字组成,用于存储和表示数据。In this embodiment, the COO format is used to compress a sparse matrix (Matrix) as an example. The standard COO format is shown in Figure 4. It is necessary to store non-zero elements (values), row coordinates (row indices) and column coordinates (column indices). Therefore, three off-chip storage spaces are required to store data, row coordinates and column coordinates. In order to save storage space in the COO format, the data format for recording the data row address is changed to record whether the data is end-of-row data. Because COO has removed 0-value data, all data and their corresponding row and column addresses are valid. Therefore, in this solution, the row address is modified to a valid end-of-row flag. If it is not the last column (end of row) of a row, it is 0. If it is the last column of a row, it is 1. This can not only meet the computing requirements, but also reduce the required row address storage space to 1/32 (the address is 32 bits, and the valid end-of-row flag only requires 1 bit), as shown in Figure 5, where a byte is usually composed of 8 binary digits and is used to store and represent data.

压缩特征矩阵和压缩关系矩阵均按行存储在外部存储器中可以是指压缩特征矩阵和压缩关系矩阵均按照行坐标依次存储在外部存储器中,即,同一行的元素依次存储在为外部存储器中,通过行尾标志的标志值标识对应的元素是否为该行最后一个非零元素。The compressed feature matrix and the compressed relationship matrix are both stored in the external memory by row, which can mean that the compressed feature matrix and the compressed relationship matrix are both stored in the external memory in sequence according to row coordinates, that is, the elements of the same row are stored in the external memory in sequence, and the flag value of the end-of-row flag is used to identify whether the corresponding element is the last non-zero element of the row.

对于压缩特征矩阵,可以基于多个对应的片外存储空间来分别存储数据(非零元素)以及于非零元素对应的第一行尾标志,即,一个片外存储空间用于存储压缩特征矩阵的非零元素,另一个片外存储空间用于存储与压缩特征矩阵的非零元素对应的第一行尾标志。For the compressed feature matrix, data (non-zero elements) and the first end-of-row flags corresponding to the non-zero elements can be stored separately based on multiple corresponding off-chip storage spaces, that is, one off-chip storage space is used to store the non-zero elements of the compressed feature matrix, and another off-chip storage space is used to store the first end-of-row flags corresponding to the non-zero elements of the compressed feature matrix.

按照存储地址从外部存储器中顺序读取压缩特征矩阵中的元素,直到读取到对应的第一行尾标志的标志值为第一标志值,得到第一目标行向量,其中,第一标志值用于标识对应的元素为一个行向量的结尾,第一目标行向量为压缩特征矩阵中的一个行向量,第一目标行向量对应指定特征矩阵的一个行向量中的非零元素。The elements in the compressed feature matrix are sequentially read from the external memory according to the storage address until the flag value of the corresponding first row end flag is read as the first flag value, and the first target row vector is obtained, wherein the first flag value is used to identify the corresponding element as the end of a row vector, the first target row vector is a row vector in the compressed feature matrix, and the first target row vector corresponds to a non-zero element in a row vector of the specified feature matrix.

对于压缩关系矩阵,可以基于多个对应的片外存储空间来分别存储压缩关系矩阵的数据(非零元素)以及于压缩关系矩阵的非零元素对应的第二行尾标志,即,一个片外存储空间用于存储压缩关系矩阵的非零元素,另一个片外存储空间用于存储与压缩关系矩阵的非零元素对应的第二行尾标志。For a compressed relational matrix, the data (non-zero elements) of the compressed relational matrix and the second end-of-row flags corresponding to the non-zero elements of the compressed relational matrix can be stored separately based on multiple corresponding off-chip storage spaces, that is, one off-chip storage space is used to store the non-zero elements of the compressed relational matrix, and another off-chip storage space is used to store the second end-of-row flags corresponding to the non-zero elements of the compressed relational matrix.

按照存储地址从外部存储器中顺序读取压缩关系矩阵中的元素,直到读取到对应的第二行尾标志的标志值为第二标志值,得到第二目标行向量,其中,第二标志值用于标识对应的元素为一个行向量的结尾,第二目标行向量为压缩关系矩阵中的一个行向量,第二目标行向量对应指定关系矩阵的一个行向量中的非零元素。The elements in the compressed relation matrix are sequentially read from the external memory according to the storage address until the flag value of the corresponding second row end flag is read as the second flag value, and the second target row vector is obtained, wherein the second flag value is used to identify the corresponding element as the end of a row vector, the second target row vector is a row vector in the compressed relation matrix, and the second target row vector corresponds to a non-zero element in a row vector of the specified relation matrix.

通过本实施例,通过将标准COO格式的行坐标(行索引)修改为行尾标志用于标识对应非零元素是否为该行最后一个非零元素,可以减小存储空间。Through this embodiment, the storage space can be reduced by modifying the row coordinates (row index) in the standard COO format into a row end mark for identifying whether the corresponding non-zero element is the last non-zero element of the row.

在一个示例性实施例中,压缩特征矩阵和压缩关系矩阵均按行存储在外部存储器中、并按行被读取,读取的压缩特征矩阵的行向量为第一目标行向量,读取的压缩关系矩阵的行向量为第二目标行向量;In an exemplary embodiment, the compressed feature matrix and the compressed relationship matrix are both stored in an external memory by row and read by row, the row vector of the compressed feature matrix read is the first target row vector, and the row vector of the compressed relationship matrix read is the second target row vector;

对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,包括:The first matrix operation is performed on the read compressed feature matrix and the weight matrix to obtain an intermediate matrix, including:

S21,将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作,得到中间矩阵中与第一目标行向量对应的行向量:分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘,得到n个第一参考行向量,其中,m小于或者等于指定特征矩阵的总列数,n为第一当前行向量中包含的元素的数量;对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作;S21, using the read first target row vector as the first current row vector to perform the following first operation to obtain a row vector in the intermediate matrix corresponding to the first target row vector: respectively multiplying the m-th element of the first current row vector with the m-th row vector of the weight matrix to obtain n first reference row vectors, where m is less than or equal to the total number of columns of the specified feature matrix, and n is the number of elements contained in the first current row vector; performing a column-wise addition operation on the elements of the n first reference row vectors to obtain a row vector in the intermediate matrix corresponding to the first current row vector, where the column-wise addition operation is an operation of respectively adding elements on the same column;

对读取的压缩关系矩阵和中间矩阵执行第二矩阵运算,得到目标矩阵,包括:The second matrix operation is performed on the compressed relation matrix and the intermediate matrix to obtain the target matrix, including:

S22,将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作,得到目标矩阵中与第二目标行向量对应的行向量:分别将第二当前行向量的第p个元素与中间矩阵的第p个行向量相乘,得到q个第二参考行向量,其中,p≤指定关系矩阵的总列数,q为第二当前行向量中包含的元素的数量;对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作。S22, uses the read second target row vector as the second current row vector to perform the following second operation to obtain a row vector in the target matrix corresponding to the second target row vector: multiply the p-th element of the second current row vector by the p-th row vector of the intermediate matrix respectively to obtain q second reference row vectors, where p≤the total number of columns of the specified relationship matrix, and q is the number of elements contained in the second current row vector; perform a column-wise addition operation on the elements of the q second reference row vectors to obtain a row vector in the target matrix corresponding to the second current row vector, where the column-wise addition operation is an operation of adding elements on the same column respectively.

在本实施例中,结合图6,以图神经网络有4个隐藏层(hidden layer),每个隐藏层中都需要执行聚合和组合更新两个矩阵计算为例。单隐藏层中的2个矩阵计算可以如图7所示。为了便于说明,对所有的数据大小做出假设:假设图feature分割后有37万个点,每个点的特征有128个,则原始feature X的大小为37万×128。4层的权重W大小分别为128×256,256×512,512×1024,1024×176。关系矩阵A和为点数的平方,即为37万×37万。In this embodiment, in combination with Figure 6, a graph neural network has 4 hidden layers, and two matrix calculations of aggregation and combination update need to be performed in each hidden layer. The two matrix calculations in a single hidden layer can be shown in Figure 7. For the sake of convenience, assumptions are made about all data sizes: assuming that there are 370,000 points after the graph feature is segmented, and each point has 128 features, then the size of the original feature X is 370,000 × 128. The sizes of the weights W of the 4 layers are 128 × 256, 256 × 512, 512 × 1024, and 1024 × 176, respectively. The sum of the relationship matrices A and is the square of the number of points, that is, 370,000 × 370,000.

矩阵计算XW或者A(XW),以A(XW)、W大小为128×256(第一层)为例,第一层的隐藏层的两个矩阵计算如图8所示。The matrix calculation is XW or A(XW). Taking A(XW) and W as 128×256 (first layer) as an example, the two matrix calculations of the hidden layer of the first layer are shown in FIG8 .

如果采用列乘的方式,即A的行和(XW)的列点乘,当并行计算,则需要使用37万个乘加PE单元,且A矩阵中的每一行数据都需要读256次,才能完成一行的计算,对于内存带宽需求过大,并且没有考虑数据的复用,当采用串行计算,则耗时过长。If column multiplication is used, that is, dot multiplication of the rows of A and the columns of (XW), when parallel calculation is performed, 370,000 multiplication-addition PE units are needed, and each row of data in the A matrix needs to be read 256 times to complete the calculation of one row. The memory bandwidth requirement is too large, and data reuse is not considered. When serial calculation is used, it takes too long.

如果采用的是列行相乘的方式,即A的列和(XW)的行矩阵乘,则需要缓存的中间计算结果为37万×256大小,无法在片上缓存,需要增加额外的外部存储器的读写开销。If the column-row multiplication method is used, that is, the column of A is multiplied by the row matrix of (XW), the intermediate calculation results that need to be cached are 370,000×256 in size, which cannot be cached on the chip, and additional external memory read and write overhead needs to be added.

基于此,本实施例中采用行乘的方式,即A一行的点和(XW)的对应的每一行相乘并累加,隐藏层的两个矩阵计算如图9所示。Based on this, the present embodiment adopts a row multiplication method, that is, the points in a row of A are multiplied and accumulated with each corresponding row of (XW). The calculation of the two matrices of the hidden layer is shown in FIG9 .

由于A中的数据大部分为稀疏数据,因此所需要参与计算的数据非常少,如图9所示,计算输出结果第一行的矩阵计算中间结果,只需要计算A矩阵中的数据a和权重W中对应行进行乘加,并累加上数据b和权重对应行的乘加结果,即可计算出最终数据结果的第一行数据,如图10所示。Since most of the data in A is sparse data, very little data is required for the calculation. As shown in Figure 9, to calculate the intermediate result of the matrix calculation of the first row of the output result, it is only necessary to calculate the multiplication and addition of the data a in the A matrix and the corresponding row in the weight W, and accumulate the multiplication and addition results of the data b and the corresponding row of the weight, and then the first row of data of the final data result can be calculated, as shown in Figure 10.

基于此,可使所有的乘加单元不会因为数据稀疏而进行无效的零值计算,所需要的缓存大小也只需要计算结果的一行大小(256个),可完全在片上完成,无需片外存储单元。且该计算过程只需要读取一遍A矩阵数据,可以降低整个模型中读写带宽需求。在本实施例中,由于最大的权重列为1024列,因此加速PE单元和缓存单元设计分别需要1024个,以覆盖最大并行计算需求,乘加单元和缓存如图11所示。Based on this, all multiplication and addition units will not perform invalid zero value calculations due to data sparsity, and the required cache size only needs to be the size of one row of the calculation result (256), which can be completed completely on-chip without the need for off-chip storage units. And the calculation process only needs to read the A matrix data once, which can reduce the read and write bandwidth requirements in the entire model. In this embodiment, since the largest weight column is 1024 columns, the acceleration PE unit and cache unit design require 1024 respectively to cover the maximum parallel computing requirements, and the multiplication and addition unit and cache are shown in Figure 11.

具体地,结合图12,XW计算步骤可以采用行乘的方式,XW进行行乘可以包括以下步骤:Specifically, with reference to FIG12 , the XW calculation step may adopt a row multiplication method, and the row multiplication of XW may include the following steps:

将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作,得到中间矩阵中与第一目标行向量对应的行向量:分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘,得到n个第一参考行向量,其中,m小于或者等于指定特征矩阵的总列数,n为第一当前行向量中包含的元素的数量;对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作。The read first target row vector is used as the first current row vector to perform the following first operation to obtain a row vector in the intermediate matrix corresponding to the first target row vector: the m-th element of the first current row vector is multiplied by the m-th row vector of the weight matrix to obtain n first reference row vectors, where m is less than or equal to the total number of columns of the specified feature matrix, and n is the number of elements contained in the first current row vector; a column-wise addition operation is performed on the elements of the n first reference row vectors to obtain a row vector in the intermediate matrix corresponding to the first current row vector, where the column-wise addition operation is an operation of adding elements on the same column respectively.

具体地,A(XW)进行行乘可以包括以下步骤:Specifically, the row multiplication of A(XW) may include the following steps:

将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作,得到目标矩阵中与第二目标行向量对应的行向量:分别将第二当前行向量的第p个元素与中间矩阵的第p个行向量相乘,得到q个第二参考行向量,其中,p≤指定关系矩阵的总列数,q为第二当前行向量中包含的元素的数量;对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作。The read second target row vector is used as the second current row vector to perform the following second operation to obtain a row vector in the target matrix corresponding to the second target row vector: the p-th element of the second current row vector is multiplied by the p-th row vector of the intermediate matrix to obtain q second reference row vectors, where p≤the total number of columns of the specified relationship matrix, and q is the number of elements contained in the second current row vector; a column-wise addition operation is performed on the elements of the q second reference row vectors to obtain a row vector in the target matrix corresponding to the second current row vector, where the column-wise addition operation is an operation of adding elements on the same column respectively.

通过本实施例,通过采用行乘的方式得到矩阵乘积,并行读取左矩阵的某一行的非零元素与右矩阵的对应行相乘,并将乘积进行按列累加,可完全在片上完成,无需片外存储单元,可以降低整个模型中读写带宽需求。Through this embodiment, the matrix product is obtained by adopting the row multiplication method, the non-zero elements of a row of the left matrix are read in parallel and multiplied with the corresponding row of the right matrix, and the products are accumulated column by column. This can be completed completely on the chip without the need for off-chip storage units, which can reduce the read and write bandwidth requirements in the entire model.

在一个示例性实施例中,在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之前,上述方法还包括:缓存读取的第一目标行向量,其中,第一运算操作是对缓存的第一目标行向量执行的;In an exemplary embodiment, before performing the following first operation on the read first target row vector as the first current row vector, the method further includes: caching the read first target row vector, wherein the first operation is performed on the cached first target row vector;

在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之后,上述方法还包括:After performing the following first operation on the read first target row vector as the first current row vector, the method further includes:

S31,将得到的中间矩阵中与第一目标行向量对应的行向量存储到第一存储器中,其中,第一存储器用于存储中间矩阵的行向量,第一存储器为图神经网络加速器所集成在的处理芯片的片上存储器;清除缓存的第一目标行向量;S31, storing the row vector corresponding to the first target row vector in the obtained intermediate matrix into a first memory, wherein the first memory is used to store the row vector of the intermediate matrix, and the first memory is an on-chip memory of the processing chip in which the graph neural network accelerator is integrated; clearing the cached first target row vector;

在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之前,上述方法还包括:Before performing the following second operation using the read second target row vector as the second current row vector, the method further includes:

S32,缓存读取的第二目标行向量,其中,第二运算操作是对缓存的第二目标行向量执行的;S32, caching the second target row vector read, wherein the second operation is performed on the cached second target row vector;

S33,在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之后,上述方法还包括:将得到的目标矩阵中与第二目标行向量对应的行向量存储到第二存储器中,其中,第二存储器用于存储目标矩阵的行向量,第二存储器为图神经网络加速器所集成在的处理芯片的片上存储器;清除缓存的第二目标行向量。S33, after performing the following second operation on the read second target row vector as the second current row vector, the method further includes: storing the row vector corresponding to the second target row vector in the obtained target matrix into a second memory, wherein the second memory is used to store the row vectors of the target matrix, and the second memory is an on-chip memory of the processing chip in which the graph neural network accelerator is integrated; clearing the cached second target row vector.

在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之前,按行读取并缓存读取的第一目标行向量,第一运算操作是对缓存的第一目标行向量执行的,即,将第一目标行向量中的每个元素与权重矩阵的对应行向量相乘,这里,相乘的第一目标行向量的元素的列坐标与权重矩阵的对应行向量的行坐标相同,第一目标行向量的元素的列坐标与该非零元素在指定特征矩阵中的列坐标相同。Before the read first target row vector is used as the first current row vector to perform the following first operation, the read first target row vector is read and cached row by row, and the first operation is performed on the cached first target row vector, that is, each element in the first target row vector is multiplied by the corresponding row vector of the weight matrix, where the column coordinates of the elements of the multiplied first target row vector are the same as the row coordinates of the corresponding row vector of the weight matrix, and the column coordinates of the elements of the first target row vector are the same as the column coordinates of the non-zero element in the specified feature matrix.

在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之后,将得到的中间矩阵中与第一目标行向量对应的行向量存储到第一存储器中,其中,第一存储器用于存储中间矩阵的行向量,第一存储器为图神经网络加速器所集成在的处理芯片的片上存储器;After performing the following first operation on the read first target row vector as the first current row vector, the row vector corresponding to the first target row vector in the obtained intermediate matrix is stored in a first memory, wherein the first memory is used to store the row vectors of the intermediate matrix, and the first memory is an on-chip memory of a processing chip in which the graph neural network accelerator is integrated;

在完成压缩特征矩阵的当前行向量的计算之后,为了便于循环执行压缩特征矩阵的下一行向量的计算,清除缓存的第一目标行向量,以将压缩特征矩阵的下一行向量存入缓存中,对缓存中的压缩特征矩阵的下一行向量执行第一运算操作。After completing the calculation of the current row vector of the compressed feature matrix, in order to facilitate the loop execution of the calculation of the next row vector of the compressed feature matrix, the first target row vector of the cache is cleared to store the next row vector of the compressed feature matrix in the cache, and the first operation is performed on the next row vector of the compressed feature matrix in the cache.

与前述实施例类似的,在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之前,缓存读取的第二目标行向量,其中,第二运算操作是对缓存的第二目标行向量执行的;Similar to the above embodiment, before performing the following second operation on the read second target row vector as the second current row vector, the read second target row vector is cached, wherein the second operation is performed on the cached second target row vector;

在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之后,将得到的目标矩阵中与第二目标行向量对应的行向量存储到第二存储器中,其中,第二存储器用于存储目标矩阵的行向量,第二存储器为图神经网络加速器所集成在的处理芯片的片上存储器;清除缓存的第二目标行向量。After performing the following second operation on the read second target row vector as the second current row vector, the row vector corresponding to the second target row vector in the obtained target matrix is stored in a second memory, wherein the second memory is used to store row vectors of the target matrix, and the second memory is an on-chip memory of the processing chip in which the graph neural network accelerator is integrated; clear the cached second target row vector.

通过本实施例,基于秲疏矩阵的行尾标识按行读取并缓存读取到的行向量,按行执行矩阵行乘法,并在该行计算完成后,清除缓存以便读取并缓存下一行向量,可以数据读取速度,迚一步提高推理加速效率。Through this embodiment, the row vectors read and cached are read row by row based on the row end mark of the sparse matrix, matrix row multiplication is performed row by row, and after the calculation of the row is completed, the cache is cleared to read and cache the next row vector, which can speed up the data reading and further improve the reasoning acceleration efficiency.

在一个示例性实施例中,在对读取的压缩特征矩阵不权重矩阵执行第一矩阵运算之前,上述方法还包括:In an exemplary embodiment, before performing the first matrix operation on the read compressed feature matrix and the weight matrix, the method further includes:

S41,将权重矩阵按列存储到R个第三存储器中,其中,R为权重矩阵的总列数,第三存储器用于存储权重矩阵的列向量,第三存储器为图神经网络加速器所集成在的处理芯片的片上存储器;S41, storing the weight matrix in R third memories by column, where R is the total number of columns of the weight matrix, the third memory is used to store the column vectors of the weight matrix, and the third memory is an on-chip memory of the processing chip in which the graph neural network accelerator is integrated;

S42,在分别将第一当前行向量的第m个元素不权重矩阵的第m个行向量相乘之前,上述方法还包括:并行读取O个第三存储器所存储的列向量中第m个位置上的元素,得到权重矩阵的第m个行向量。S42, before respectively multiplying the m-th element of the first current row vector and the m-th row vector of the weight matrix, the method further includes: reading in parallel the element at the m-th position in the column vectors stored in O third memories to obtain the m-th row vector of the weight matrix.

由于W矩阵非常小,适合存储到片上存储器中,为了能满足并行计算需求,并行输出W的某一行参不计算,可以将W按照列分别存储到多个RAM缓存中,存储不计算方式如图13所示。Since the W matrix is very small, it is suitable for storage in the on-chip memory. In order to meet the parallel computing requirements, a row of W is output in parallel without calculation. W can be stored in multiple RAM caches according to columns. The storage and calculation method is shown in Figure 13.

在图神经网络包括多个隐藏局,不同隐藏局对应不同大小的权重矩阵的情况下,为了满足计算需求,可以设计用于存储权重矩阵的片上存储器的数量不图神经网络的多个权重矩阵中列数最大的权重矩阵的列数一致。When a graph neural network includes multiple hidden layers and different hidden layers correspond to weight matrices of different sizes, in order to meet computing requirements, the number of on-chip memories used to store weight matrices can be designed to be consistent with the number of columns of the weight matrix with the largest number of columns among the multiple weight matrices of the graph neural network.

例如,在本实施例中,以图神经网络有4个隐藏局,4局的权重W大小分别为128×256,256×512,512×1024,1024×176为例,最大的权重列为1024列,则可以设计1024个片上存储器用于按列存储权重矩阵。For example, in this embodiment, taking the graph neural network having 4 hidden layers, the weights W of the 4 layers are 128×256, 256×512, 512×1024, and 1024×176 respectively, and the largest weight column is 1024 columns, 1024 on-chip memories can be designed to store the weight matrix by column.

可选的,在确定当前隐藏局对应的权重矩阵之后,在分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘之前,并行读取O个第三存储器所存储的列向量中第m个位置上的元素,得到权重矩阵的第m个行向量。这里,O为当前隐藏层对应的权重矩阵的列数,O可以小于或者等于R,O个第三存储器所存储的列向量中第m个位置上的元素即为权重矩阵的第m个行向量,这里,m可以取1至当前隐藏层对应的权重矩阵的行数。Optionally, after determining the weight matrix corresponding to the current hidden layer, before respectively multiplying the mth element of the first current row vector with the mth row vector of the weight matrix, the elements at the mth position in the column vectors stored in O third memories are read in parallel to obtain the mth row vector of the weight matrix. Here, O is the number of columns of the weight matrix corresponding to the current hidden layer, O can be less than or equal to R, and the element at the mth position in the column vectors stored in the O third memories is the mth row vector of the weight matrix, where m can be 1 to the number of rows of the weight matrix corresponding to the current hidden layer.

由于片上存储器与处理器芯片集成在一起,其访问速度通常比片外存储器快,片外存储器需要通过总线与处理器通信,增加了访问延迟。Since on-chip memory is integrated with the processor chip, its access speed is usually faster than that of off-chip memory. Off-chip memory needs to communicate with the processor through a bus, which increases access latency.

通过本实施例,通过将稠密权重矩阵按列进行存储,且存储到片上存储器,可以提高数据读取速度,且较大的稀疏矩阵只需要读取一次,复用稠密矩阵,减少数据读取带宽需求。Through this embodiment, by storing the dense weight matrix by column and storing it in the on-chip memory, the data reading speed can be improved, and a larger sparse matrix only needs to be read once, and the dense matrix is reused to reduce the data reading bandwidth requirement.

在一个示例性实施例中,对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量,包括:In an exemplary embodiment, performing a column-wise addition operation on elements of the n first reference row vectors to obtain a row vector corresponding to the first current row vector in the intermediate matrix includes:

S51,在每得到一个第一参考行向量之后,对当前得到的第一参考行向量与缓存的第一中间行向量执行按列相加操作,得到第一向量相加结果,其中,第一中间行向量用于记录对已得到的第一参考行向量执行按列相加操作所得到的行向量;S51, after each first reference row vector is obtained, performing a column-wise addition operation on the currently obtained first reference row vector and the cached first intermediate row vector to obtain a first vector addition result, wherein the first intermediate row vector is used to record a row vector obtained by performing the column-wise addition operation on the obtained first reference row vector;

S52,使用第一向量相加结果更新缓存的第一中间行向量,得到更新后的第一中间行向量,其中,中间矩阵中与第一当前行向量对应的行向量为最后一次更新后的第一中间行向量;S52, using the first vector addition result to update the cached first intermediate row vector to obtain an updated first intermediate row vector, wherein the row vector corresponding to the first current row vector in the intermediate matrix is the first intermediate row vector after the last update;

对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量,包括:Performing a column-wise addition operation on the elements of the q second reference row vectors to obtain a row vector corresponding to the second current row vector in the target matrix, including:

S53,在每得到一个第二参考行向量之后,对当前得到的第二参考行向量与缓存的第二中间行向量执行按列相加操作,得到第二向量相加结果,其中,第二中间行向量用于记录对已得到的第二参考行向量执行按列相加操作所得到的行向量;S53, after each second reference row vector is obtained, performing a column-wise addition operation on the currently obtained second reference row vector and the cached second intermediate row vector to obtain a second vector addition result, wherein the second intermediate row vector is used to record a row vector obtained by performing the column-wise addition operation on the obtained second reference row vector;

S54,使用第二向量相加结果更新缓存的第二中间行向量,得到更新后的第二中间行向量,其中,目标矩阵中与第二当前行向量对应的行向量为最后一次更新后的第二中间行向量。S54, using the second vector addition result to update the cached second intermediate row vector to obtain an updated second intermediate row vector, wherein the row vector corresponding to the second current row vector in the target matrix is the second intermediate row vector after the last update.

分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘,得到n个第一参考行向量,对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量可以是逐步累加的。The m-th element of the first current row vector is multiplied by the m-th row vector of the weight matrix to obtain n first reference row vectors, and the column-by-column addition operation is performed on the elements of the n first reference row vectors to obtain the row vectors in the intermediate matrix corresponding to the first current row vector, which can be gradually accumulated.

例如,在本实施例中,以第一当前行向量包括a、b两个非零元素、权重矩阵中的对应行向量分别为(a1,......,at)、(b1,......,bt)为例,在每得到一个第一参考行向量b(b1,.....,bt)之后,对当前得到的第一参考行向量b(b1,.....,bt)与缓存的第一中间行向量a(a1,......,at)执行按列相加操作,得到第一向量相加结果:For example, in this embodiment, taking the case where the first current row vector includes two non-zero elements a and b, and the corresponding row vectors in the weight matrix are (a1 , ...,at ) and (b1 , ...,bt ) respectively, after each first reference row vector b(b1 , ...,bt ) is obtained, a column-wise addition operation is performed on the currently obtained first reference row vector b(b1 , ...,bt ) and the cached first intermediate row vector a(a1, ...,at ) to obtain a first vector addition result:

a(a1,......,at)+b(b1,......,bt)=(a*a1+b*b1,......,a*at+b*bt),a(a1 ,..., at )+b(b1 ,..., bt )=(a*a1 +b*b1 ,..., a*at +b*bt ),

其中,第一中间行向量用于记录对已得到的第一参考行向量执行按列相加操作所得到的行向量;The first intermediate row vector is used to record a row vector obtained by performing a column-wise addition operation on the obtained first reference row vector;

使用第一向量相加结果(a*a1+b*b1,......,a*at+b*bt)更新缓存的第一中间行向量a(a1,......,at),得到更新后的第一中间行向量(a*a1+b*b1,......,a*at+b*bt),其中,中间矩阵中与第一当前行向量对应的行向量为最后一次更新后的第一中间行向量;Using the first vector addition result (a*a1 +b*b1 , ..., a*at +b*bt ) to update the cached first intermediate row vector a(a1 , ...,at ) to obtain an updated first intermediate row vector (a*a1 +b*b1 , ...,a*at +b*bt ), wherein the row vector in the intermediate matrix corresponding to the first current row vector is the first intermediate row vector after the last update;

与对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量类似,对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量也可以是逐步累加的。Similar to performing a column-wise addition operation on the elements of the n first reference row vectors to obtain a row vector in the intermediate matrix corresponding to the first current row vector, performing a column-wise addition operation on the elements of the q second reference row vectors to obtain a row vector in the target matrix corresponding to the second current row vector can also be gradually accumulated.

在每得到一个第二参考行向量之后,对当前得到的第二参考行向量与缓存的第二中间行向量执行按列相加操作,得到第二向量相加结果,其中,第二中间行向量用于记录对已得到的第二参考行向量执行按列相加操作所得到的行向量;使用第二向量相加结果更新缓存的第二中间行向量,得到更新后的第二中间行向量,其中,目标矩阵中与第二当前行向量对应的行向量为最后一次更新后的第二中间行向量。After each second reference row vector is obtained, a column-wise addition operation is performed on the currently obtained second reference row vector and the cached second intermediate row vector to obtain a second vector addition result, wherein the second intermediate row vector is used to record the row vector obtained by performing the column-wise addition operation on the obtained second reference row vector; the second vector addition result is used to update the cached second intermediate row vector to obtain an updated second intermediate row vector, wherein the row vector in the target matrix corresponding to the second current row vector is the second intermediate row vector after the last update.

可选地,在本实施中,缓存单元的数量可以设计为大于或者等于最大权重列数,即,最大的权重矩阵列数。Optionally, in this embodiment, the number of cache units may be designed to be greater than or equal to the maximum number of weight columns, that is, the maximum number of weight matrix columns.

通过本实施例,通过逐步累加并更新缓存的中间结果(即第一中间行向量、第二中间行向量),可以减少中间结果占用的存储空间,避免在计算完左矩阵某一行的所有非零元素之后再进行相加,导致每一个非零元素与其对应的右矩阵的对应行向量相乘的乘积均需占用一定的存储空间。Through this embodiment, by gradually accumulating and updating the cached intermediate results (i.e., the first intermediate row vector and the second intermediate row vector), the storage space occupied by the intermediate results can be reduced, and it is avoided that the addition is performed after all the non-zero elements of a row of the left matrix are calculated, resulting in the product of each non-zero element multiplied by the corresponding row vector of the right matrix corresponding to it taking up a certain amount of storage space.

在一个示例性实施例中,权重矩阵的数量为多个,图神经网络加速器包括计算引擎PE阵列,PE阵列包含按照一维方向排列的S个PE单元,S≥列数最多的权重矩阵所包含的列数;In an exemplary embodiment, the number of weight matrices is multiple, and the graph neural network accelerator includes a computing engine PE array, the PE array includes S PE units arranged in a one-dimensional direction, and S ≥ the number of columns included in the weight matrix with the largest number of columns;

对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量,包括:Performing a column-wise addition operation on the elements of the n first reference row vectors to obtain a row vector corresponding to the first current row vector in the intermediate matrix, including:

S61,使用S个PE单元中的至少部分PE单元分别对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量;S61, using at least some of the S PE units to perform column-wise addition operations on the elements of the n first reference row vectors, respectively, to obtain a row vector corresponding to the first current row vector in the intermediate matrix;

对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量,包括:Performing a column-wise addition operation on the elements of the q second reference row vectors to obtain a row vector corresponding to the second current row vector in the target matrix, including:

S62,使用S个PE单元中的至少部分PE单元分别对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量。S62, use at least some of the S PE units to perform column-wise addition operations on the elements of the q second reference row vectors, to obtain a row vector corresponding to the second current row vector in the target matrix.

图神经网络是一种用于处理图结构数据的深度学习模型,它能够捕捉图结构数据中节点之间的关系并学习节点的表示,隐藏层是图神经网络中的中间层,用于表示和处理图结构数据中的节点特征。图神经网络通常包含一个或多个隐藏层,每个隐藏层都负责处理和转换节点特征,以便在后续层中进行更高层次的抽象和学习;权重矩阵是图神经网络中的关键参数,用于定义节点特征之间的转换关系。每个隐藏层都有一个对应的权重矩阵,用于将输入节点特征映射到输出节点特征。权重矩阵的大小通常与节点特征的维度相关。在图神经网络的每个隐藏层中,节点特征会通过权重矩阵进行线性变换,然后通常还会应用非线性激活函数。这样的变换有助于捕捉节点间的复杂关系,提取有用的特征表示。图神经网络中的信息传递是通过邻接矩阵(也称关系矩阵)和权重矩阵共同实现的。邻接矩阵表示图结构数据中节点之间的连接关系,而权重矩阵则定义了节点特征之间的转换关系。通过将邻接矩阵与权重矩阵相乘,可以实现节点间的消息传递和特征聚合。在图神经网络的训练过程中,权重矩阵的参数需要通过梯度下降等优化算法进行调整,这个过程旨在最小化预测误差,使得模型能够更准确地学习图结构数据中的节点表示。A graph neural network is a deep learning model for processing graph-structured data. It can capture the relationship between nodes in graph-structured data and learn the representation of nodes. The hidden layer is the middle layer in the graph neural network, which is used to represent and process node features in graph-structured data. A graph neural network usually contains one or more hidden layers, each of which is responsible for processing and transforming node features so that higher-level abstraction and learning can be performed in subsequent layers; the weight matrix is a key parameter in the graph neural network, which is used to define the transformation relationship between node features. Each hidden layer has a corresponding weight matrix, which is used to map input node features to output node features. The size of the weight matrix is usually related to the dimension of the node features. In each hidden layer of the graph neural network, the node features are linearly transformed by the weight matrix, and then a nonlinear activation function is usually applied. Such a transformation helps capture the complex relationship between nodes and extract useful feature representations. Information transfer in the graph neural network is achieved through the adjacency matrix (also called the relationship matrix) and the weight matrix. The adjacency matrix represents the connection relationship between nodes in the graph-structured data, while the weight matrix defines the transformation relationship between node features. By multiplying the adjacency matrix with the weight matrix, message passing and feature aggregation between nodes can be achieved. During the training process of graph neural networks, the parameters of the weight matrix need to be adjusted through optimization algorithms such as gradient descent. This process aims to minimize the prediction error so that the model can learn the node representation in the graph structure data more accurately.

即,在图神经网络中,隐藏层负责处理和转换节点特征,而权重矩阵则定义了节点特征之间的转换关系。与前述实施例类似的,图神经网络可以有多个隐藏层,每个隐藏层对应一个权重矩阵,即,图神经网络中的权重矩阵的数量可以为多个,图神经网络加速器包括计算引擎PE阵列,PE阵列包含按照一维方向排列的S个PE单元,S≥列数最多的权重矩阵所包含的列数。为了尽可能不让计算阵列出现等待的情况,采用了PE单核计算性能较弱,但是数量较多的硬件架构方案(类似FPGA的DSP阵列,单片有9000多个计算阵列)。将PE阵列按照一维方向排好,PE计算单元的数量大于极限情况下的图神经网络的权重矩阵的行长度。That is, in the graph neural network, the hidden layer is responsible for processing and converting node features, while the weight matrix defines the conversion relationship between node features. Similar to the aforementioned embodiment, the graph neural network can have multiple hidden layers, each hidden layer corresponds to a weight matrix, that is, the number of weight matrices in the graph neural network can be multiple, and the graph neural network accelerator includes a computing engine PE array, and the PE array contains S PE units arranged in a one-dimensional direction, and S ≥ the number of columns contained in the weight matrix with the largest number of columns. In order to prevent the computing array from waiting as much as possible, a hardware architecture solution with weak single-core computing performance but a large number of PEs is adopted (similar to the DSP array of FPGA, with more than 9,000 computing arrays on a single chip). The PE array is arranged in a one-dimensional direction, and the number of PE computing units is greater than the row length of the weight matrix of the graph neural network in the extreme case.

对n个第一参考行向量的元素执行按列相加操作所使用的PE单元的数量与当前隐藏层对应的权重矩阵的列数一致,相应的,对q个第二参考行向量的元素执行按列相加操作所使用的PE单元的数量也与当前隐藏层对应的权重矩阵的列数一致。The number of PE units used to perform column-by-column addition operations on the elements of the n first reference row vectors is consistent with the number of columns of the weight matrix corresponding to the current hidden layer. Correspondingly, the number of PE units used to perform column-by-column addition operations on the elements of the q second reference row vectors is also consistent with the number of columns of the weight matrix corresponding to the current hidden layer.

通过本实施例,通过设计与图神经网络的权重矩阵的列数匹配的PE单元,可以实现按行读取稀疏矩阵的非零元素与稠密矩阵(右矩阵)的对应行相乘并累加,提高数据读取速度的同时不影响PE阵列的计算速度。Through this embodiment, by designing a PE unit that matches the number of columns of the weight matrix of the graph neural network, it is possible to read the non-zero elements of the sparse matrix row by row and multiply and accumulate them with the corresponding rows of the dense matrix (right matrix), thereby improving the data reading speed without affecting the calculation speed of the PE array.

虽然本申请描述的实施方式如上,但上述描述内容和定义只是为了便于理解本申请的实施方式,并非用以限定本申请,任何在不脱离本申请精神和范围前提下所做的修改与变化,特别是任何基于此算法的软件及硬件实现方案等均在本发明的保护范围之内。Although the implementation methods described in this application are as above, the above descriptions and definitions are only for the purpose of facilitating the understanding of the implementation methods of this application, and are not intended to limit this application. Any modifications and changes made without departing from the spirit and scope of this application, especially any software and hardware implementation solutions based on this algorithm, are within the scope of protection of the present invention.

需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the above-mentioned method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the present application is not limited by the order of the actions described, because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present application.

根据本申请实施例的另一方面,还提供了一种图神经网络的加速结构,该系统用于实现上述实施例中所提供的图神经网络的加速方法,已经进行过说明的不再赘述。如以下所使用的,术语″模块″可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。According to another aspect of the embodiments of the present application, an acceleration structure of a graph neural network is also provided, and the system is used to implement the acceleration method of the graph neural network provided in the above-mentioned embodiments, which has been described and will not be repeated here. As used below, the term "module" can implement a combination of software and/or hardware for a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and conceivable.

在本实施例中,图神经网络用于处理图结构数据,图神经网络包括隐藏层,隐藏层用于执行指定矩阵运算,图神经网络加速结构包括图神经网络加速器,图神经网络加速器用于实现指定矩阵运算;In this embodiment, the graph neural network is used to process graph structure data, the graph neural network includes a hidden layer, the hidden layer is used to perform specified matrix operations, and the graph neural network acceleration structure includes a graph neural network accelerator, and the graph neural network accelerator is used to implement specified matrix operations;

包括:图神经网络加速器,用于按照存储地址从外部存储器中顺序读取压缩特征矩阵,其中,压缩特征矩阵是对与图结构数据对应的指定特征矩阵进行数据压缩得到的;对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算,得到中间矩阵,其中,指定矩阵运算包括第一矩阵运算;按照存储地址从外部存储器中顺序读取压缩关系矩阵,其中,压缩关系矩阵是对与图结构数据对应的指定关系矩阵进行数据压缩得到的,指定关系矩阵用于描述图结构数据中的节点之间的关系;对读取的压缩关系矩阵和中间矩阵执行第二矩阵运算,得到目标矩阵,其中,指定矩阵运算还包括第二矩阵运算。It includes: a graph neural network accelerator, which is used to sequentially read a compressed feature matrix from an external memory according to a storage address, wherein the compressed feature matrix is obtained by compressing a specified feature matrix corresponding to graph structure data; perform a first matrix operation on the read compressed feature matrix and a weight matrix to obtain an intermediate matrix, wherein the specified matrix operation includes a first matrix operation; sequentially read a compressed relationship matrix from an external memory according to a storage address, wherein the compressed relationship matrix is obtained by compressing a specified relationship matrix corresponding to the graph structure data, and the specified relationship matrix is used to describe the relationship between nodes in the graph structure data; perform a second matrix operation on the read compressed relationship matrix and the intermediate matrix to obtain a target matrix, wherein the specified matrix operation also includes a second matrix operation.

可选地,上述实施例中的各个方法步骤S202至步骤S208可以是由图神经网络的加速结构中的图神经网络加速器执行的,已经进行过说明的,在此不做赘述。Optionally, each method step S202 to step S208 in the above embodiment can be executed by a graph neural network accelerator in the acceleration structure of the graph neural network, which has been explained and will not be repeated here.

在一个示例性实施例中,以加速PE单元和缓存单元分别设计1024个为例,对本实施例所提出的图神经网络加速结构进行解释说明。In an exemplary embodiment, taking the example of designing 1024 acceleration PE units and 1024 cache units respectively, the graph neural network acceleration structure proposed in this embodiment is explained.

在本实施例中的图神经网络加速结构可以如图14所示,对图14中所包括的技术术语做出如下解释说明:The graph neural network acceleration structure in this embodiment can be shown in FIG14 , and the technical terms included in FIG14 are explained as follows:

PCIe-core:PCI Express(PCIe)核心是一种硬件接口,它允许计算机内部的设备通过高速串行通信链路进行连接。在GNN领域,GPU或其他加速器可能通过PCIe核心与系统的CPU连接,以便于数据传输和处理。PCIe-core: PCI Express (PCIe) core is a hardware interface that allows devices inside a computer to be connected via a high-speed serial communication link. In the field of GNN, a GPU or other accelerator may be connected to the system's CPU via a PCIe core to facilitate data transfer and processing.

PCIe BUS:PCIe总线(BUS)是一种点对点串行扩展总线标准,用于计算机硬件的高带宽连接。它主要用于连接中央处理单元(CPU)和外围设备,如图形处理单元(GPU)。PCIe BUS: PCIe bus (BUS) is a point-to-point serial expansion bus standard for high-bandwidth connections of computer hardware. It is mainly used to connect central processing units (CPUs) and peripheral devices such as graphics processing units (GPUs).

Local Bus:局部总线是连接计算机内各组件的通信线路。它通常用于连接CPU、内存和其他外围设备,如输入/输出控制器。局部总线的速度通常低于PCIe总线。Local Bus: A local bus is a communication line that connects components within a computer. It is usually used to connect the CPU, memory, and other peripherals such as input/output controllers. The speed of a local bus is usually lower than that of a PCIe bus.

DMA:直接内存访问(DMA)是一种允许硬件子系统在没有CPU介入的情况下直接读写系统内存的技术。在GNN领域,DMA可以用于加速数据传输,尤其是在处理大量图数据时。DMA: Direct memory access (DMA) is a technology that allows hardware subsystems to directly read and write system memory without CPU intervention. In the field of GNN, DMA can be used to accelerate data transfer, especially when processing large amounts of graph data.

PE array:处理元素(Processing Element,PE)阵列是一种用于并行计算的硬件架构。在GNN领域,PE阵列可以用于同时处理多个图节点,从而提高图神经网络的计算效率。PE array: Processing Element (PE) array is a hardware architecture for parallel computing. In the field of GNN, PE array can be used to process multiple graph nodes at the same time, thereby improving the computational efficiency of graph neural networks.

HBM2:高带宽存储器2(High Bandwidth Memory 2,HBM2)是一种高性能的3D堆叠DRAM技术,用于提供高速、高带宽的内存解决方案。HBM2可以用于提高GNN的内存访问速度,从而提高整体计算性能。HBM技术通过垂直堆叠多层DRAM芯片,并使用TSV(ThroughSilicon Via,穿透硅通孔)和μbumps技术实现与逻辑芯片的连接。这种堆叠和连接方式使得在小尺寸的封装中实现了高带宽和高传输速度,非常适合高性能人工智能服务器中的GPU显存需求。HBM技术的优点包括高带宽、低延迟和高密度,能够提供比传统DRAM更高的性能。使用HBM技术,可以有效减少处理器等待数据加载的时间,提高人工智能算力的利用率和计算效率。HBM2: High Bandwidth Memory 2 (HBM2) is a high-performance 3D stacked DRAM technology used to provide high-speed, high-bandwidth memory solutions. HBM2 can be used to increase the memory access speed of GNNs, thereby improving overall computing performance. HBM technology stacks multiple layers of DRAM chips vertically and uses TSV (Through Silicon Via) and μbumps technology to connect to logic chips. This stacking and connection method enables high bandwidth and high transmission speed in a small package, which is very suitable for GPU memory requirements in high-performance artificial intelligence servers. The advantages of HBM technology include high bandwidth, low latency and high density, which can provide higher performance than traditional DRAM. Using HBM technology, the time the processor waits for data to load can be effectively reduced, and the utilization and computing efficiency of artificial intelligence computing power can be improved.

位宽BRAM组:位宽(Bit Width)是指数据传输时一次可以传输的数据位数。块随机存取存储器(Block Random Access Memory,BRAM)是一种用于现场可编程门阵列(FPGA)的存储器资源。位宽BRAM组是指具有特定位宽的BRAM资源集合,可以用于存储GNN的中间数据或权重。Bit width BRAM group: Bit width refers to the number of data bits that can be transmitted at one time during data transmission. Block Random Access Memory (BRAM) is a memory resource used in Field Programmable Gate Arrays (FPGAs). Bit width BRAM group refers to a collection of BRAM resources with a specific bit width, which can be used to store intermediate data or weights of GNNs.

reg:″reg″是″register″的缩写,指的是寄存器。寄存器是一种高速存储器,用于存储指令、数据或地址。在GNN领域,寄存器可以用来存储图节点的特征、权重或其他中间计算结果。reg: "reg" is the abbreviation of "register", which refers to a register. A register is a high-speed memory used to store instructions, data, or addresses. In the field of GNN, registers can be used to store features, weights, or other intermediate calculation results of graph nodes.

Bias HBM2(High Bandwidth Memory 2):HBM2是一种高性能的3D堆叠DRAM(动态随机存取存储器)技术,常用于高性能计算系统。Bias HBM2可能是指在HBM2存储器中应用的某种偏置技术,用于改善性能或功耗。但这个术语不是一个标准的行业术语,可能是特定应用或设计中的一个概念。Bias HBM2 (High Bandwidth Memory 2): HBM2 is a high-performance 3D stacked DRAM (dynamic random access memory) technology commonly used in high-performance computing systems. Bias HBM2 may refer to a certain bias technology applied in HBM2 memory to improve performance or power consumption. However, this term is not a standard industry term and may be a concept in a specific application or design.

Data HBM2:这可能是指在HBM2存储器中存储的数据。HBM2是一种高速存储器技术,适用于需要大量数据传输的应用,如高性能计算、图形处理单元(GPU)和AI加速器。Data HBM2: This likely refers to data stored in HBM2 memory, a high-speed memory technology suitable for applications that require large amounts of data transfer, such as high-performance computing, graphics processing units (GPUs), and AI accelerators.

BRAM(Block RAM):在FPGA中,BRAM是一种内置的存储器资源,可以作为片上存储器使用。BRAM块可以配置为不同的存储器类型,如单端口或双端口RAM,FIFOs等。BRAM是FPGA设计中的一个关键组成部分,用于实现快速的数据存储和访问。BRAM (Block RAM): In FPGA, BRAM is a built-in memory resource that can be used as on-chip memory. BRAM blocks can be configured as different memory types, such as single-port or dual-port RAM, FIFOs, etc. BRAM is a key component in FPGA design for fast data storage and access.

Pong BRAM:通常用于描述FPGA中的一种特殊的BRAM配置。在某些FPGA架构中,两个BRAM块可以被配对形成一个更大的存储器,其中一个BRAM块被称为″Pong″。这种配置可以提高存储器的容量和性能。Pong BRAM: Often used to describe a special BRAM configuration in FPGAs. In some FPGA architectures, two BRAM blocks can be paired to form a larger memory, where one BRAM block is called a "Pong". This configuration can increase memory capacity and performance.

Ping BRAM:与Pong BRAM相对应,Ping BRAM是配对BRAM块中的另一个。在某些FPGA架构中,Ping和Pong BRAM可以用于实现更高效的数据传输,例如在数据流水线或双缓存系统中。Ping BRAM: The counterpart of Pong BRAM, Ping BRAM is the other in a paired BRAM block. In some FPGA architectures, Ping and Pong BRAM can be used to achieve more efficient data transfer, such as in data pipelining or double buffering systems.

在本实施例中,采用单一计算单元兼容图神经网络的两步矩阵计算,采用先计算XW,再计算A(XW)的方式,保留图神经网络的数据稀疏性;两步矩阵计算都采用行乘的计算方式,可以最大限度的降低图神经网络最大数据的读写需求;乘加缓存的行乘计算硬件结构;权重数据的按列单独存储结构;权重数据的并行读取结构;关系矩阵COO压缩后的行尾标志1bit存储方式,降低存储需求;根据行尾标志进行数据的转存和缓存的清零控制方式;针对稀疏矩阵,无时间间隔连续有效数据计算方式;多行并行计算PE结构;无清零控制需求的流水线稀疏矩阵有效数计算;计算结果的乒乓缓存结构(双HBM);feature矩阵缓存大小,只需要一行数据的大小;关系矩阵的缓存只需要不超过32个;最大的关系矩阵A,只需要读取一遍;第一步矩阵计算的中间计算结果缓存只需要一行数据大小;第二步矩阵计算的中间结果只需要读取一行数据;流水线自动屏蔽0值结果;复用片上的权重数据(数据量最小);根据行尾标志来对片上计算缓存清零。In this embodiment, a single computing unit is used to be compatible with the two-step matrix calculation of the graph neural network. The method of first calculating XW and then calculating A(XW) is adopted to retain the data sparsity of the graph neural network; the two-step matrix calculation adopts the row multiplication calculation method, which can minimize the read and write requirements of the maximum data of the graph neural network; the hardware structure of the row multiplication calculation of the multiplication-addition cache; the column-by-column storage structure of the weight data; the parallel reading structure of the weight data; the 1-bit storage method of the end-of-row flag after the relationship matrix COO is compressed to reduce the storage requirement; the data transfer and cache clearing control method according to the end-of-row flag; for sparse matrices, there is no time interval for continuous Continued valid data calculation method; multi-row parallel calculation PE structure; pipeline sparse matrix valid number calculation without clear control requirements; ping-pong cache structure (dual HBM) of calculation results; feature matrix cache size, only the size of one row of data; the cache of the relationship matrix only needs no more than 32; the largest relationship matrix A only needs to be read once; the intermediate calculation result cache of the first step matrix calculation only needs the size of one row of data; the intermediate result of the second step matrix calculation only needs to read one row of data; the pipeline automatically masks 0 value results; reuse of on-chip weight data (minimum data volume); clear the on-chip calculation cache according to the end-of-row flag.

通过本实施例,针对图神经网络的计算特点,即两步矩阵计算的稀疏度,缓存要求,计算密度不同等,设计了一种兼容两步矩阵计算的图神经网络加速器。通过调整计算顺序,调整矩阵计算方式为行乘,调整存储结构,使得进行稀疏矩阵计算时,没有无效0值计算,并且保证数据量最大的矩阵只需要读取一次,减少读写外部存储器的数据带宽需求,实现了无空闲计算周期的高密度计算结构,提高了图神经网络计算的硬件资源使用率,由于是单一加速器结构,也降低了相关芯片的设计难度。有效提高了图神经网络推理计算的加速能力。Through this embodiment, a graph neural network accelerator compatible with two-step matrix calculations is designed in view of the calculation characteristics of the graph neural network, namely the sparsity of the two-step matrix calculations, cache requirements, and different calculation densities. By adjusting the calculation order, adjusting the matrix calculation method to row multiplication, and adjusting the storage structure, there is no invalid 0 value calculation when performing sparse matrix calculations, and ensuring that the matrix with the largest amount of data only needs to be read once, reducing the data bandwidth requirements for reading and writing external memory, and realizing a high-density computing structure with no idle computing cycles, the hardware resource utilization rate of the graph neural network calculation is improved, and because it is a single accelerator structure, the design difficulty of related chips is also reduced. Effectively improve the acceleration capability of graph neural network reasoning calculations.

在一个示例性实施例中,压缩特征矩阵和压缩关系矩阵均按行存储在外部存储器中,外部存储器中还存储有与压缩特征矩阵中的元素对应的第一行尾标志以及与压缩特征矩阵中的元素对应的第二行尾标志,其中,第一行尾标志用于标识对应的元素是否为一个行向量的结尾,第二行尾标志用于标识对应的元素是否为一个行向量的结尾;In an exemplary embodiment, the compressed feature matrix and the compressed relationship matrix are both stored in an external memory by row, and the external memory also stores a first row end flag corresponding to the elements in the compressed feature matrix and a second row end flag corresponding to the elements in the compressed feature matrix, wherein the first row end flag is used to identify whether the corresponding element is the end of a row vector, and the second row end flag is used to identify whether the corresponding element is the end of a row vector;

图神经网络加速器,用于按照存储地址从外部存储器中顺序读取压缩特征矩阵中的元素,直到读取到对应的第一行尾标志的标志值为第一标志值,得到第一目标行向量,其中,第一标志值用于标识对应的元素为一个行向量的结尾,第一目标行向量为压缩特征矩阵中的一个行向量;A graph neural network accelerator is used to sequentially read elements in a compressed feature matrix from an external memory according to a storage address until a flag value of a corresponding first row end flag is read as a first flag value, thereby obtaining a first target row vector, wherein the first flag value is used to identify that the corresponding element is the end of a row vector, and the first target row vector is a row vector in the compressed feature matrix;

图神经网络加速器,用于按照存储地址从外部存储器中顺序读取压缩关系矩阵中的元素,直到读取到对应的第二行尾标志的标志值为第二标志值,得到第二目标行向量,其中,第二标志值用于标识对应的元素为一个行向量的结尾,第二目标行向量为压缩关系矩阵中的一个行向量。A graph neural network accelerator is used to sequentially read elements in a compressed relationship matrix from an external memory according to a storage address until a flag value of a corresponding second row end flag is read as a second flag value, thereby obtaining a second target row vector, wherein the second flag value is used to identify the corresponding element as the end of a row vector, and the second target row vector is a row vector in the compressed relationship matrix.

在一个示例性实施例中,压缩特征矩阵和压缩关系矩阵均按行存储在外部存储器中、并按行被读取,读取的压缩特征矩阵的行向量为第一目标行向量,读取的压缩关系矩阵的行向量为第二目标行向量;In an exemplary embodiment, the compressed feature matrix and the compressed relationship matrix are both stored in an external memory by row and read by row, the row vector of the compressed feature matrix read is the first target row vector, and the row vector of the compressed relationship matrix read is the second target row vector;

图神经网络加速器,用于将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作,得到中间矩阵中与第一目标行向量对应的行向量:分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘,得到n个第一参考行向量,其中,m小于或者等于指定特征矩阵的总列数,n为第一当前行向量中包含的元素的数量;对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作;A neural network accelerator is used to perform the following first operation on the read first target row vector as the first current row vector to obtain a row vector corresponding to the first target row vector in the intermediate matrix: respectively multiplying the mth element of the first current row vector with the mth row vector of the weight matrix to obtain n first reference row vectors, wherein m is less than or equal to the total number of columns of the specified feature matrix, and n is the number of elements contained in the first current row vector; performing a column-by-column addition operation on the elements of the n first reference row vectors to obtain a row vector corresponding to the first current row vector in the intermediate matrix, wherein the column-by-column addition operation is an operation of respectively adding elements on the same column;

图神经网络加速器,用于将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作,得到目标矩阵中与第二目标行向量对应的行向量:分别将第二当前行向量的第p个元素与中间矩阵的第p个行向量相乘,得到q个第二参考行向量,其中,p≤指定关系矩阵的总列数,q为第二当前行向量中包含的元素的数量;对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量,其中,按列相加操作为相同列上的元素分别相加的操作。A graph neural network accelerator is used to perform the following second operation on the read second target row vector as the second current row vector to obtain a row vector in the target matrix corresponding to the second target row vector: multiply the p-th element of the second current row vector by the p-th row vector of the intermediate matrix respectively to obtain q second reference row vectors, wherein p≤the total number of columns of the specified relationship matrix, and q is the number of elements contained in the second current row vector; perform a column-wise addition operation on the elements of the q second reference row vectors to obtain a row vector in the target matrix corresponding to the second current row vector, wherein the column-wise addition operation is an operation of adding elements on the same column respectively.

在一个示例性实施例中,图神经网络加速器,用于在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之前,缓存读取的第一目标行向量,其中,第一运算操作是对缓存的第一目标行向量执行的;In an exemplary embodiment, a graph neural network accelerator is used to cache the read first target row vector before performing the following first operation on the read first target row vector as a first current row vector, wherein the first operation is performed on the cached first target row vector;

图神经网络加速器,用于在将读取的第一目标行向量作为第一当前行向量执行以下的第一运算操作之后,将得到的中间矩阵中与第一目标行向量对应的行向量存储到第一存储器中,其中,第一存储器用于存储中间矩阵的行向量,第一存储器为图神经网络加速器所集成在的处理芯片的片上存储器;清除缓存的第一目标行向量;A graph neural network accelerator, configured to store the row vector corresponding to the first target row vector in the intermediate matrix obtained into a first memory after performing the following first operation on the first target row vector read as the first current row vector, wherein the first memory is used to store the row vectors of the intermediate matrix, and the first memory is an on-chip memory of a processing chip in which the graph neural network accelerator is integrated; clear the cached first target row vector;

图神经网络加速器,用于在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之前,缓存读取的第二目标行向量,其中,第二运算操作是对缓存的第二目标行向量执行的;A neural network accelerator, configured to cache the read second target row vector before performing the following second operation using the read second target row vector as a second current row vector, wherein the second operation is performed on the cached second target row vector;

图神经网络加速器,用于在将读取的第二目标行向量作为第二当前行向量执行以下的第二运算操作之后,将得到的目标矩阵中与第二目标行向量对应的行向量存储到第二存储器中,其中,第二存储器用于存储目标矩阵的行向量,第二存储器为图神经网络加速器所集成在的处理芯片的片上存储器;清除缓存的第二目标行向量。A graph neural network accelerator is used to store the row vector corresponding to the second target row vector in the obtained target matrix into a second memory after performing the following second operation on the read second target row vector as the second current row vector, wherein the second memory is used to store the row vectors of the target matrix, and the second memory is an on-chip memory of the processing chip in which the graph neural network accelerator is integrated; and clear the cached second target row vector.

在一个示例性实施例中,图神经网络加速器,用于在对读取的压缩特征矩阵与权重矩阵执行第一矩阵运算之前,将权重矩阵按列存储到R个第三存储器中,其中,R为权重矩阵的总列数,第三存储器用于存储权重矩阵的列向量,第三存储器为图神经网络加速器所集成在的处理芯片的片上存储器;In an exemplary embodiment, a graph neural network accelerator is used to store the weight matrix by column in R third memories before performing a first matrix operation on the read compressed feature matrix and the weight matrix, wherein R is the total number of columns of the weight matrix, the third memory is used to store the column vector of the weight matrix, and the third memory is an on-chip memory of a processing chip in which the graph neural network accelerator is integrated;

图神经网络加速器,用于在分别将第一当前行向量的第m个元素与权重矩阵的第m个行向量相乘之前,并行读取O个第三存储器所存储的列向量中第m个位置上的元素,得到权重矩阵的第m个行向量。A graph neural network accelerator is used to read in parallel the elements at the m-th position in the column vectors stored in O third memories before multiplying the m-th element of the first current row vector with the m-th row vector of the weight matrix respectively, so as to obtain the m-th row vector of the weight matrix.

在一个示例性实施例中,图神经网络加速器,用于在每得到一个第一参考行向量之后,对当前得到的第一参考行向量与缓存的第一中间行向量执行按列相加操作,得到第一向量相加结果,其中,第一中间行向量用于记录对已得到的第一参考行向量执行按列相加操作所得到的行向量;使用第一向量相加结果更新缓存的第一中间行向量,得到更新后的第一中间行向量,其中,中间矩阵中与第一当前行向量对应的行向量为最后一次更新后的第一中间行向量;In an exemplary embodiment, a graph neural network accelerator is used to perform a column-by-column addition operation on the currently obtained first reference row vector and the cached first intermediate row vector after each first reference row vector is obtained to obtain a first vector addition result, wherein the first intermediate row vector is used to record the row vector obtained by performing the column-by-column addition operation on the obtained first reference row vector; and use the first vector addition result to update the cached first intermediate row vector to obtain an updated first intermediate row vector, wherein the row vector in the intermediate matrix corresponding to the first current row vector is the first intermediate row vector after the last update;

图神经网络加速器,用于在每得到一个第二参考行向量之后,对当前得到的第二参考行向量与缓存的第二中间行向量执行按列相加操作,得到第二向量相加结果,其中,第二中间行向量用于记录对已得到的第二参考行向量执行按列相加操作所得到的行向量;使用第二向量相加结果更新缓存的第二中间行向量,得到更新后的第二中间行向量,其中,目标矩阵中与第二当前行向量对应的行向量为最后一次更新后的第二中间行向量。A graph neural network accelerator is used to perform a column-by-column addition operation on the currently obtained second reference row vector and the cached second intermediate row vector after each second reference row vector is obtained to obtain a second vector addition result, wherein the second intermediate row vector is used to record the row vector obtained by performing the column-by-column addition operation on the obtained second reference row vector; and use the second vector addition result to update the cached second intermediate row vector to obtain an updated second intermediate row vector, wherein the row vector in the target matrix corresponding to the second current row vector is the second intermediate row vector after the last update.

在一个示例性实施例中,权重矩阵的数量为多个,图神经网络加速器包括计算引擎PE阵列,PE阵列包含按照一维方向排列的S个PE单元,S≥列数最多的权重矩阵所包含的列数;In an exemplary embodiment, the number of weight matrices is multiple, and the graph neural network accelerator includes a computing engine PE array, the PE array includes S PE units arranged in a one-dimensional direction, and S ≥ the number of columns included in the weight matrix with the largest number of columns;

图神经网络加速器,用于使用S个PE单元中的至少部分PE单元分别对n个第一参考行向量的元素执行按列相加操作,得到中间矩阵中与第一当前行向量对应的行向量;使用S个PE单元中的至少部分PE单元分别对q个第二参考行向量的元素执行按列相加操作,得到目标矩阵中与第二当前行向量对应的行向量。A graph neural network accelerator is used to use at least some of the S PE units to perform column-by-column addition operations on the elements of n first reference row vectors, respectively, to obtain a row vector in an intermediate matrix corresponding to the first current row vector; and use at least some of the S PE units to perform column-by-column addition operations on the elements of q second reference row vectors, respectively, to obtain a row vector in a target matrix corresponding to the second current row vector.

根据本申请实施例的又一方面,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to another aspect of the embodiments of the present application, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to execute the steps of any of the above method embodiments when executed.

在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器、随机存取存储器、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to, various media that can store computer programs, such as a USB flash drive, a read-only memory, a random access memory, a mobile hard disk, a magnetic disk, or an optical disk.

本申请的实施例还提供了一种计算机程序产品,上述计算机程序产品包括计算机程序,计算机程序被处理器执行时实现上述任一项方法实施例中的步骤。An embodiment of the present application further provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the steps in any of the above method embodiments are implemented.

本申请的实施例还提供了另一种计算机程序产品,包括非易失性计算机可读存储介质,非易失性计算机可读存储介质存储计算机程序,计算机程序被处理器执行时实现上述任一项方法实施例中的步骤。An embodiment of the present application further provides another computer program product, including a non-volatile computer-readable storage medium, wherein the non-volatile computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in any of the above method embodiments are implemented.

本中请的实施例还提供了又一种计算机程序,该计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中;计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述任一项方法实施例中的步骤。The embodiments of the present application also provide another computer program, which includes computer instructions stored in a computer-readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in any one of the above method embodiments.

根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序产品包括计算机程序/指令,该计算机程序/指令包含用于执行流程图所示的方法的程序代码。在这样的实施例中,参考图15,该计算机程序可以通过通信部分1509从网络上被下载和安装,和/或从可拆卸介质1511被安装。在该计算机程序被中央处理器1501执行时,执行本申请实施例提供的各种功能。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。According to one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program/instruction, the computer program/instruction comprising a program code for executing the method shown in the flowchart. In such an embodiment, with reference to FIG. 15 , the computer program can be downloaded and installed from a network via a communication portion 1509, and/or installed from a removable medium 1511. When the computer program is executed by a central processing unit 1501, various functions provided by the embodiments of the present application are executed. The above serial numbers of the embodiments of the present application are for description only and do not represent the advantages and disadvantages of the embodiments.

参考图15,图15是本申请实施例提供的一种可选的电子设备的计算机系统的结构框图。Refer to Figure 15, which is a structural block diagram of a computer system of an optional electronic device provided in an embodiment of the present application.

图15示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。如图15所示,计算机系统1500包括中央处理器1501(Central Processing Unit,CPU),其可以根据存储在只读存储器1502(Read-Only Memory,ROM)中的程序或者从存储部分1508加载到随机访问存储器1503(Random Access Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器1503中,还存储有系统操作所需的各种程序和数据。中央处理器1501、在只读存储器1502以及随机访问存储器1503通过总线1504彼此相连。输入/输出接口1505(Input/Output接口,简称为I/O接口)也连接至总线1504。Figure 15 schematically shows a computer system block diagram for implementing an electronic device of an embodiment of the present application. As shown in Figure 15, computer system 1500 includes a central processing unit 1501 (Central Processing Unit, CPU), which can perform various appropriate actions and processes according to the program stored in a read-only memory 1502 (Read-Only Memory, ROM) or the program loaded from a storage part 1508 into a random access memory 1503 (Random Access Memory, RAM). In the random access memory 1503, various programs and data required for system operation are also stored. The central processing unit 1501, the read-only memory 1502 and the random access memory 1503 are connected to each other through a bus 1504. An input/output interface 1505 (Input/Output interface, referred to as an I/O interface) is also connected to the bus 1504.

以下部件连接至输入/输出接口1505:包括键盘、鼠标等的输入部分1506;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1507;包括硬盘等的存储部分1508;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分1509。通信部分1509经由诸如因特网的网络执行通信处理。驱动器1510也根据需要连接至输入/输出接口1505。可拆卸介质1511,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1510上,以便于从其上读出的计算机程序根据需要被安装入存储部分1508。The following components are connected to the input/output interface 1505: an input section 1506 including a keyboard, a mouse, etc.; an output section 1507 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1508 including a hard disk, etc.; and a communication section 1509 including a network interface card such as a LAN card, a modem, etc. The communication section 1509 performs communication processing via a network such as the Internet. A drive 1510 is also connected to the input/output interface 1505 as needed. A removable medium 1511, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1510 as needed so that a computer program read therefrom is installed into the storage section 1508 as needed.

特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1509从网络上被下载和安装,和/或从可拆卸介质1511被安装。在该计算机程序被中央处理器1501执行时,执行本申请的系统中限定的各种功能。In particular, according to an embodiment of the present application, the process described in each method flow chart can be implemented as a computer software program. For example, an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer readable medium, and the computer program contains a program code for executing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication part 1509, and/or installed from the removable medium 1511. When the computer program is executed by the central processor 1501, various functions defined in the system of the present application are executed.

需要说明的是,图15示出的电子设备的计算机系统1500仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1500 of the electronic device shown in FIG. 15 is merely an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.

根据本申请实施例的又一方面,还提供了一种电子设备,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。According to another aspect of the embodiments of the present application, an electronic device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.

在一个示例性实施例中,上述电子设备还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。In an exemplary embodiment, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary implementation modes, and this embodiment will not be described in detail herein.

显然,本领域的技术人员应该明白,上述的本申请实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请实施例不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the modules or steps of the above-mentioned embodiments of the present application can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that herein, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. In this way, the embodiments of the present application are not limited to any specific combination of hardware and software.

以上仅为本申请的优选实施例而已,并不用于限制本申请实施例,对于本领域的技术人员来说,本申请实施例可以有各种更改和变化。凡在本申请实施例的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。The above are only preferred embodiments of the present application and are not intended to limit the embodiments of the present application. For those skilled in the art, the embodiments of the present application may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the principles of the embodiments of the present application shall be included in the protection scope of the embodiments of the present application.

Claims (10)

And performing a first matrix operation on the read compressed feature matrix and the weight matrix to obtain an intermediate matrix, wherein the first matrix operation comprises the following steps: the read first target row vector is used as a first current row vector to execute the following first operation, so as to obtain a row vector corresponding to the first target row vector in the intermediate matrix: multiplying the m-th element of the first current row vector with the m-th row vector of the weight matrix to obtain n first reference row vectors, wherein m is smaller than or equal to the total column number of the appointed feature matrix, and n is the number of elements contained in the first current row vector; performing column-wise addition operation on the elements of the n first reference row vectors to obtain row vectors corresponding to the first current row vector in the intermediate matrix, wherein the column-wise addition operation is operation of adding elements on the same column respectively;
And performing a second matrix operation on the read compressed relation matrix and the intermediate matrix to obtain a target matrix, wherein the method comprises the following steps: and taking the read second target row vector as a second current row vector to execute the following second operation to obtain a row vector corresponding to the second target row vector in the target matrix: multiplying the p-th element of the second current row vector with the p-th row vector of the intermediate matrix to obtain q second reference row vectors, wherein p is less than or equal to the total column number of the appointed relation matrix, and q is the number of elements contained in the second current row vector; and performing column-wise addition operation on the elements of the q second reference row vectors to obtain row vectors corresponding to the second current row vectors in the target matrix, wherein the column-wise addition operation is operation of adding elements on the same column respectively.
Performing column-wise addition on the elements of the n first reference row vectors to obtain a row vector corresponding to the first current row vector in the intermediate matrix, where the column-wise addition includes: after each first reference row vector is obtained, performing column-wise addition operation on the first obtained first reference row vector and a cached first intermediate row vector to obtain a first vector addition result, wherein the first intermediate row vector is used for recording a row vector obtained by performing the column-wise addition operation on the obtained first reference row vector; updating the cached first intermediate line vector by using the first vector addition result to obtain the updated first intermediate line vector, wherein the line vector corresponding to the first current line vector in the intermediate matrix is the first intermediate line vector updated last time;
Performing column-wise addition on the q elements of the second reference row vectors to obtain a row vector corresponding to the second current row vector in the target matrix, where the column-wise addition includes: after each second reference row vector is obtained, performing column-wise addition operation on the currently obtained second reference row vector and a cached second intermediate row vector to obtain a second vector addition result, wherein the second intermediate row vector is used for recording a row vector obtained by performing the column-wise addition operation on the obtained second reference row vector; and updating the cached second intermediate line vector by using the second vector addition result to obtain the updated second intermediate line vector, wherein the line vector corresponding to the second current line vector in the target matrix is the second intermediate line vector updated last time.
The graphic neural network accelerator is used for sequentially reading compression feature matrixes from the external memory according to storage addresses, wherein the compression feature matrixes are obtained by carrying out data compression on appointed feature matrixes corresponding to the graphic structure data; performing a first matrix operation on the read compression feature matrix and the weight matrix to obtain an intermediate matrix, wherein the specified matrix operation comprises the first matrix operation; sequentially reading a compression relation matrix from the external memory according to a storage address, wherein the compression relation matrix is obtained by carrying out data compression on a designated relation matrix corresponding to the graph structure data, and the designated relation matrix is used for describing the relation among nodes in the graph structure data; and executing a second matrix operation on the read compressed relation matrix and the intermediate matrix to obtain a target matrix, wherein the appointed matrix operation further comprises the second matrix operation.
CN202410693570.7A2024-05-302024-05-30 Graph neural network acceleration method and graph neural network acceleration structurePendingCN118690803A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410693570.7ACN118690803A (en)2024-05-302024-05-30 Graph neural network acceleration method and graph neural network acceleration structure

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410693570.7ACN118690803A (en)2024-05-302024-05-30 Graph neural network acceleration method and graph neural network acceleration structure

Publications (1)

Publication NumberPublication Date
CN118690803Atrue CN118690803A (en)2024-09-24

Family

ID=92778024

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410693570.7APendingCN118690803A (en)2024-05-302024-05-30 Graph neural network acceleration method and graph neural network acceleration structure

Country Status (1)

CountryLink
CN (1)CN118690803A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119557552A (en)*2025-01-242025-03-04山东云海国创云计算装备产业创新中心有限公司 A fast calculation method, device, equipment and storage medium for matrix multiplication
CN119721119A (en)*2025-02-282025-03-28苏州元脑智能科技有限公司 Graph structure data processing method, accelerator, storage medium and program product

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119557552A (en)*2025-01-242025-03-04山东云海国创云计算装备产业创新中心有限公司 A fast calculation method, device, equipment and storage medium for matrix multiplication
CN119721119A (en)*2025-02-282025-03-28苏州元脑智能科技有限公司 Graph structure data processing method, accelerator, storage medium and program product
CN119721119B (en)*2025-02-282025-07-18苏州元脑智能科技有限公司Graph structure data processing method, accelerator, storage medium, and program product

Similar Documents

PublicationPublication DateTitle
CN109284817B (en)Deep separable convolutional neural network processing architecture/method/system and medium
CN108427990B (en)Neural network computing system and method
CN110245751B (en) A GEMM computing method and device
CN106445471B (en)Processor and the method for performing matrix multiplication on a processor
CN109165728B (en) A basic computing unit and computing method of a convolutional neural network
CN118690803A (en) Graph neural network acceleration method and graph neural network acceleration structure
CN104915322B (en)A kind of hardware-accelerated method of convolutional neural networks
CN111582465B (en)Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN112703511B (en) Operation accelerator and data processing method
CN110929854B (en)Data processing method and device and hardware accelerator
CN108665063A (en)Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system
CN114003201B (en) Matrix transformation method, device and convolutional neural network accelerator
CN109993293A (en) A Deep Learning Accelerator for Stacked Hourglass Networks
CN109948787B (en)Arithmetic device, chip and method for neural network convolution layer
CN114995782A (en) Data processing method, apparatus, device and readable storage medium
CN115983350A (en) A Binary Weighted Convolutional Neural Network Accelerator and RISC-V System-on-Chip
CN118760651A (en) A sparse on-chip training hardware accelerator architecture and implementation method thereof
CN111859277B (en)Sparse matrix vector multiplication vectorization implementation method
CN116992203A (en) A method for large-scale high-throughput sparse matrix-vector integer multiplication based on FPGA
CN110377874B (en) Convolution operation method and system
Zhang et al.YOLOv3-tiny object detection SOC based on FPGA platform
CN113485750B (en)Data processing method and data processing device
WO2023019972A1 (en)Computing apparatus, method and system, and circuit, chip and device
CN118349778A (en)Memory access computing method, computing system and storage medium for 4-bit quantization
CN113128688B (en)General AI parallel reasoning acceleration structure and reasoning equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp