CN117876842A

Movatterモバイル変換

Info

Publication number: CN117876842A
Application number: CN202410041103.6A
Authority: CN
Inventors: 杜玉卓; 贺凯; 焦爱明; 王泽广; 舒兴杰; 王延生; 邹广仁; 孙风伟
Original assignee: Yantai 500 Heating Co ltd; Xian Thermal Power Research Institute Co Ltd
Current assignee: Yantai 500 Heating Co ltd; Xian Thermal Power Research Institute Co Ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-04-12

Abstract

Translated fromChinese

本发明公开了一种基于生成对抗网络的工业产品异常检测方法及系统，属于人工智能技术领域。首先，设计生成器以及基于多尺度的判别器，判别器的目的就是为了尽可能的正确的辨别出真实产品和生成产品，而生成器的目的是尽可能生成贴近真实工业产品的图像。接着，设计生对抗网络结构。最后，在数据集上实验，分析对比后验证该方法在面向工业场景下的异常检测任务中的有效性。基于异常检测方法采用Transformer模型构建生成器和判别器，更擅长捕捉全局上下文信息，并优化了计算量和局部特征捕获能力，不需要对训练数据进行标注，非常适用于缺乏异常标注数据的工业场景中，在提高检测效果的同时缓解了实际应用中的硬件需求。

The present invention discloses an industrial product anomaly detection method and system based on a generative adversarial network, and belongs to the field of artificial intelligence technology. First, a generator and a multi-scale based discriminator are designed. The purpose of the discriminator is to distinguish between real products and generated products as correctly as possible, while the purpose of the generator is to generate images that are as close to real industrial products as possible. Then, a bio-adversarial network structure is designed. Finally, experiments are conducted on data sets, and the effectiveness of this method in anomaly detection tasks for industrial scenarios is verified after analysis and comparison. Based on the anomaly detection method, the Transformer model is used to construct the generator and discriminator, which is better at capturing global context information, and optimizes the computational complexity and local feature capture capabilities. There is no need to annotate the training data, which is very suitable for industrial scenarios that lack anomaly annotated data. While improving the detection effect, it alleviates the hardware requirements in practical applications.

Description

Translated fromChinese

一种基于生成对抗网络的工业产品异常检测方法及系统A method and system for detecting anomalies of industrial products based on generative adversarial networks

技术领域Technical Field

本发明属于人工智能领域，涉及一种基于生成对抗网络的工业产品异常检测方法及系统。The present invention belongs to the field of artificial intelligence and relates to an industrial product anomaly detection method and system based on a generative adversarial network.

背景技术Background technique

面向工业场景下的产品图像异常检测任务中，为了更好的捕获全局上下文信息，一般情况是通过增加网络层的深度，通过以扩大感受野的方式，获取整个图像中的非相邻区域的关系，从而提升全局特征的提取能力，来解决因部分工业产品图像结构复杂而导致较难获取全局信息的难题。同时，随着生成对抗网络的提出，可以很好的因样本中包含无关背景而影响图像质量的问题。首先，在训练过程中不再需要带有标注的样本数据；其次，可以通过生成新的异常样本提升模型的泛化能力。但是，还存在有部分问题尚未被解决：In the task of product image anomaly detection in industrial scenarios, in order to better capture global contextual information, the depth of the network layer is generally increased, and the relationship between non-adjacent areas in the entire image is obtained by expanding the receptive field, thereby improving the ability to extract global features, to solve the problem of difficulty in obtaining global information due to the complex structure of some industrial product images. At the same time, with the introduction of generative adversarial networks, the problem of image quality being affected by irrelevant background in samples can be well solved. First, labeled sample data is no longer required during the training process; second, the generalization ability of the model can be improved by generating new abnormal samples. However, there are still some problems that have not been solved:

网络层数的加深会使参数量急剧提升。在传统的深度学习网络模型中，参数越多模型计算过程中所消耗的资源也会呈指数形式上升，同时，过于复杂的网络结构还会出现过拟合以及参数优化困难等情况；The increase in the number of network layers will lead to a sharp increase in the number of parameters. In traditional deep learning network models, the more parameters there are, the more resources consumed in the model calculation process will increase exponentially. At the same time, an overly complex network structure will also lead to overfitting and difficulty in parameter optimization.

传统深度学习模型难以获取长距离的依赖关系。通过不断扩大卷积的感受野，所提取的特征是局部空间的邻域信息，然而，相较于与当前位置距离较远的区域间的信息是不能通过卷积操作进行捕获的；Traditional deep learning models have difficulty in acquiring long-distance dependencies. By continuously expanding the receptive field of convolution, the extracted features are the neighborhood information of the local space. However, the information between areas that are farther away from the current position cannot be captured by the convolution operation;

部分生成的工业产品图像形质量较差。一方面，由于生成对抗网络的训练是一个动态博弈的过程，生成器和判别器的能力会随训练的进度而改变，从而出现训练不稳定的情况；另一方面，生成器仅学习到图像中一部分数据分布模式，从而导致生成的样本缺乏多样性，影响最终的检测结果。Some of the generated industrial product images have poor quality. On the one hand, since the training of the generative adversarial network is a dynamic game process, the capabilities of the generator and the discriminator will change with the progress of training, resulting in unstable training; on the other hand, the generator only learns a part of the data distribution pattern in the image, resulting in a lack of diversity in the generated samples, affecting the final detection results.

随着Transformer模型的提出，早期该模型是用于实现自然语言处理中的机器翻译任务，现阶段已经被广泛应用于计算机视觉领域中。相较于传统的深度学习模型，Transformer更擅长捕获全局上下文信息，但是，通过Transformer进行处理图像数据依然会存在以下问题：With the introduction of the Transformer model, the model was initially used to implement machine translation tasks in natural language processing, and has now been widely used in the field of computer vision. Compared with traditional deep learning models, Transformer is better at capturing global context information, but processing image data through Transformer still has the following problems:

模型参数量过大。使用Transformer模型对图像产品进行检测的过程中，当图像分辨率较高时，模型的计算成本会大幅度上升；The number of model parameters is too large. When using the Transformer model to detect image products, when the image resolution is high, the computational cost of the model will increase significantly;

对局部特征信息的提取能力较差。Transformer模型主要通过自注意力机制提取全局上下文信息，然而，对于图像类型的二维数据，像素的局部信息的重要性也是不容忽视的，而传统的Transformer模型对局部信息的提取能力是相对较弱的。The ability to extract local feature information is poor. The Transformer model mainly extracts global context information through the self-attention mechanism. However, for two-dimensional data of image type, the importance of local information of pixels cannot be ignored, and the traditional Transformer model is relatively weak in extracting local information.

发明内容Summary of the invention

本发明的目的在于解决现有技术中面向工业场景下的产品图像异常检测任务中模型计算量过大及局部特征信息捕获能力较弱的问题，提供一种基于生成对抗网络的工业产品异常检测方法及系统。The purpose of the present invention is to solve the problems of excessive model calculation and weak local feature information capture capability in product image anomaly detection tasks in industrial scenarios in the prior art, and to provide an industrial product anomaly detection method and system based on a generative adversarial network.

为达到上述目的，本发明采用以下技术方案予以实现：In order to achieve the above object, the present invention adopts the following technical solutions:

本发明提出的一种基于生成对抗网络的工业产品异常检测方法，包括如下步骤：The present invention proposes an industrial product anomaly detection method based on a generative adversarial network, comprising the following steps:

构建生成器和多尺度输入的判别器，基于生成器和多尺度输入的判别器构建生成对抗网络模型；Construct a generator and a discriminator for multi-scale input, and construct a generative adversarial network model based on the generator and the discriminator for multi-scale input;

将随机噪声作为生成对抗网络模型的输入，随机噪声不断学习正常产品图像中数据的分布规律，获取对抗网络模型的预测结果；Random noise is used as the input of the generative adversarial network model. The random noise continuously learns the distribution law of data in normal product images and obtains the prediction results of the adversarial network model.

通过将工业产品的对抗网络模型的预测结果与异常分数阈值比较，实现工业产品的异常检测。Anomaly detection of industrial products is achieved by comparing the prediction results of the adversarial network model of industrial products with the anomaly score threshold.

优选地，生成器是由Transformer Block构建而成的；Preferably, the generator is constructed from Transformer Blocks;

将生成器分为多层的Transformer Block提取不同尺度空间下的特征，Transformer Block每层间使用一个上采样模块，在低分辨率层中，采用Bicubic Upsample方法进行上采样处理。The generator is divided into multiple layers of Transformer Block to extract features in different scale spaces. An upsampling module is used between each layer of Transformer Block. In the low-resolution layer, the Bicubic Upsample method is used for upsampling.

优选地，当图像的分辨率超过32×32，使用pixel shuffle模块对其进行上采样操作，将输入的低分辨率图像拉伸成多个不重叠的Patch，然后重新排序为一个高分辨率图像；Preferably, when the resolution of an image exceeds 32×32, a pixel shuffle module is used to perform an upsampling operation on it, so that the input low-resolution image is stretched into multiple non-overlapping patches and then reordered into a high-resolution image;

对高分辨率图像进行处理的过程中采用基于窗口的Marked Self-attention机制对其他窗口进行掩码处理，对高分辨率图像使用piexl shuffle上采样方法。In the process of processing high-resolution images, the window-based Marked Self-attention mechanism is used to mask other windows, and the piexl shuffle upsampling method is used for high-resolution images.

优选地，多尺度输入的判别器是通过构建一个二分类器来完成对输入图像的判断；Preferably, the discriminator of multi-scale input completes the judgment of the input image by constructing a binary classifier;

将生成的图像在镜像结构搭建的多尺度判别器中进行训练，引入PatchEmbedding层处理Patch，在多尺度判别器的每个Patch Embedding层后加入了连接层。The generated image is trained in a multi-scale discriminator built with a mirror structure, and the PatchEmbedding layer is introduced to process the patch. A connection layer is added after each Patch Embedding layer of the multi-scale discriminator.

优选地，生成对抗网络模型的目标函数如下：Preferably, the objective function of the generative adversarial network model is as follows:

其中，z代表作为生成器G输入的随机噪声，p_z(z)代表随机噪声z所遵循的概率分布，G(z)代表当前生成器G生成的输出，p_data代表真实产品图像所服从的概率分布，D(x)代表当前输入图像是真实图像的概率，D(x)∈[0,1]；中的/>代表真实产品图像的分布期望；/>中的/>代表噪声的分布期望；min前缀表示使G的代价函数最小，D的代价函数最大。Where z represents the random noise as the input of the generator G, p_z (z) represents the probability distribution followed by the random noise z, G(z) represents the output generated by the current generator G, p_data represents the probability distribution obeyed by the real product image, D(x) represents the probability that the current input image is a real image, D(x)∈[0,1]; In/> Represents the expected distribution of real product images; /> In/> Represents the expected distribution of noise; the prefix min means minimizing the cost function of G and maximizing the cost function of D.

优选地，对生成对抗网络模型中编码器进行训练，生成产品所对应的潜在变量z；在整个训练过程中，生成器G和判别器D中的参数是固定的，将真实产品作为编码器的输入；然后将所查询到的潜在变量作为输入，进入到参数已经固定的生成器G中，并且将输入再重新映射回图像空间中，同时最小化生成产品图像与真实产品图像之间均方误差如下：Preferably, the encoder in the generative adversarial network model is trained to generate the latent variable z corresponding to the product; during the entire training process, the parameters in the generator G and the discriminator D are fixed, and the real product is used as the input of the encoder; then the queried latent variable is used as input to the generator G with fixed parameters, and the input is remapped back to the image space, while minimizing the mean square error between the generated product image and the real product image as follows:

其中，E(G(z))表示编码器将生成图像作为输入将其映射回潜在空间的映射过程，x表示作为输入的真实产品，n表示其中的像素数，||·||用于定义图像中灰度值的残差平方和。where E(G(z)) represents the mapping process of the encoder taking the generated image as input and mapping it back to the latent space, x represents the true product taken as input, n represents the number of pixels in it, and ||·|| is used to define the residual sum of squares of the grayscale values in the image.

优选地，将真实产品和生成产品分别输入到判别器中，将二者的特征分布值进行均方误差计算，将计算结果作为损失函数，损失函数表达式如下：Preferably, the real product and the generated product are input into the discriminator respectively, the feature distribution values of the two are subjected to mean square error calculation, and the calculation result is used as the loss function. The loss function expression is as follows:

其中，f代表判别器特征的中间层作为给定输入的计数器，n_d表示该判别器中间特征的维数，k定义为权重因子。Among them, f represents the intermediate layer of the discriminator feature as a counter for a given input,_nd represents the dimension of the discriminator intermediate feature, and k is defined as the weight factor.

本发明提出的一种基于生成对抗网络的工业产品异常检测系统，包括：The present invention proposes an industrial product anomaly detection system based on a generative adversarial network, comprising:

模型构建模块，所述模型构建模块用于构建生成器和多尺度输入的判别器，基于生成器和多尺度输入的判别器构建生成对抗网络模型；A model building module, wherein the model building module is used to build a generator and a discriminator of multi-scale input, and to build a generative adversarial network model based on the generator and the discriminator of multi-scale input;

预测结果获取模块，所述预测结果获取模块用于将随机噪声作为生成对抗网络模型的输入，随机噪声不断学习正常产品图像中数据的分布规律，获取对抗网络模型的预测结果；A prediction result acquisition module, wherein the prediction result acquisition module is used to use random noise as an input of a generative adversarial network model, wherein the random noise continuously learns the distribution law of data in normal product images and obtains the prediction result of the adversarial network model;

结果比对模块，所述结果比对模块用于通过将工业产品的对抗网络模型的预测结果与异常分数阈值比较，实现工业产品的异常检测。A result comparison module is used to achieve anomaly detection of industrial products by comparing the prediction results of the adversarial network model of the industrial products with an anomaly score threshold.

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行计算机程序时实现基于生成对抗网络的工业产品异常检测方法的步骤。A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of an industrial product anomaly detection method based on a generative adversarial network when executing the computer program.

一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现基于生成对抗网络的工业产品异常检测方法的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of an industrial product anomaly detection method based on a generative adversarial network.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提出的一种基于生成对抗网络的工业产品异常检测方法，首先，设计生成器以及基于多尺度的判别器，判别器的目的就是为了尽可能的正确的辨别出真实产品和生成产品，而生成器的目的是尽可能生成贴近真实工业产品的图像。接着，设计基于Transformer的生对抗网络结构，将随机噪声作为输入，噪声输入后进入MLP层，一方面可以通过对噪声进行一系列的线性和非线性变化增加多种变换形式，可以增强生成过程中对不同空间下数据分布规律的探索，从而在一定程度上提升产品生成图像的多样性；另一方面可以对生成产品的特征进行预调整，去除噪声中的冗余信息，保留更重要的特征信息，从而使得生成器可以更容易的在特征中学习真实产品的数据分布规律。最后，在数据集上实验，分析对比后验证该方法在面向工业场景下的异常检测任务中的有效性。基于生成对抗网络的工业产品异常检测方法采用Transformer模型构建生成器和判别器，更擅长捕捉全局上下文信息，并优化了计算量和局部特征捕获能力，不需要对训练数据进行标注计算量小，非常适用于缺乏异常标注数据的工业场景中，在提高检测效果的同时缓解了实际应用中的硬件需求。The present invention proposes a method for detecting anomalies of industrial products based on a generative adversarial network. First, a generator and a multi-scale discriminator are designed. The purpose of the discriminator is to distinguish between real products and generated products as accurately as possible, while the purpose of the generator is to generate images as close to real industrial products as possible. Then, a Transformer-based generative adversarial network structure is designed, and random noise is used as input. After the noise is input, it enters the MLP layer. On the one hand, a series of linear and nonlinear changes can be made to the noise to increase multiple transformation forms, which can enhance the exploration of data distribution laws in different spaces during the generation process, thereby improving the diversity of product generated images to a certain extent; on the other hand, the features of the generated products can be pre-adjusted to remove redundant information in the noise and retain more important feature information, so that the generator can more easily learn the data distribution laws of real products in the features. Finally, experiments are conducted on the data set, and the effectiveness of this method in anomaly detection tasks in industrial scenarios is verified after analysis and comparison. The industrial product anomaly detection method based on generative adversarial networks uses the Transformer model to build a generator and a discriminator. It is better at capturing global context information and optimizes the computational complexity and local feature capture capabilities. It does not require annotation of training data and has a low computational complexity. It is very suitable for industrial scenarios that lack abnormal annotated data. It improves the detection effect while alleviating the hardware requirements in practical applications.

进一步地，Transformer Block每层间使用一个上采样模块提高分辨率。Furthermore, an upsampling module is used between each layer of Transformer Block to improve the resolution.

进一步地，在输入的第一个阶段使用基于窗口的Marked Transformer模块，因为需要将二维图像转换成一维数据的形式，所以引入Patch Embedding层处理Patch，同时为了匹配不同尺度空间下提取到的特征，在多尺度判别器的每个Patch Embedding层后加入了连接层。Furthermore, a window-based Marked Transformer module is used in the first input stage. Since the two-dimensional image needs to be converted into one-dimensional data, a Patch Embedding layer is introduced to process the patch. At the same time, in order to match the features extracted in different scale spaces, a connection layer is added after each Patch Embedding layer of the multi-scale discriminator.

进一步地，在对高分辨率图像进行处理的过程中采用基于窗口的Marked Self-attention机制对其他窗口进行掩码处理，可以大大减少计算过程对内存和算力的消耗，对高分辨率图像使用piexlshuffle上采样方法，提升分辨率的同时减少与其相应的通道数，进一步减少参与计算的参数量。Furthermore, in the process of processing high-resolution images, a window-based Marked Self-attention mechanism is used to mask other windows, which can greatly reduce the consumption of memory and computing power in the calculation process. The piexlshuffle upsampling method is used for high-resolution images to improve the resolution while reducing the corresponding number of channels, further reducing the number of parameters involved in the calculation.

进一步地，在完成损失值计算后，还分别将真实产品和生成产品分别输入到判别器中，将二者的特征分布值进行均方误差计算，并将其作为新的损失函数，辅助均方误差减小可能出现的误差对整个方法精度的影响。Furthermore, after completing the loss value calculation, the real product and the generated product are respectively input into the discriminator, and the mean square error of the feature distribution values of the two is calculated, and it is used as a new loss function to assist the mean square error in reducing the impact of possible errors on the accuracy of the entire method.

进一步地，目标函数min前缀表示使生成器的代价函数最小，即生成产品尽量贴近真实产品，判断器的代价函数最大，即判别器不能判断当前输入的图像是否是真正的产品图像。Furthermore, the min prefix of the objective function indicates that the cost function of the generator is minimized, that is, the generated product is as close to the real product as possible, and the cost function of the discriminator is maximized, that is, the discriminator cannot determine whether the current input image is a real product image.

本发明提出的一种基于生成对抗网络的工业产品异常检测系统，通过将系统划分为模型构建模块、预测结果获取模块和结果比对模块，实现工业产品的异常检测。采用模块化思想使各个模块之间相互独立，方便对各模块进行统一管理。The present invention proposes an industrial product anomaly detection system based on a generative adversarial network, which realizes anomaly detection of industrial products by dividing the system into a model building module, a prediction result acquisition module and a result comparison module. The modularization concept is adopted to make each module independent of each other, which is convenient for unified management of each module.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚的说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1为本发明的基于生成对抗网络的工业产品异常检测方法流程图。FIG1 is a flow chart of an industrial product anomaly detection method based on a generative adversarial network according to the present invention.

图2是本发明基于生成对抗网络的工业产品异常检测方法中生成对抗网络基本架构图。FIG2 is a diagram showing the basic architecture of a generative adversarial network in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图3是本发明基于生成对抗网络的工业产品异常检测方法中训练编码器生成潜在变量图。FIG3 is a diagram of potential variables generated by training an encoder in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图4是本发明基于生成对抗网络的工业产品异常检测方法中通过图像残差和特征分布的Loss训练编码器。FIG4 is a diagram of a Loss training encoder using image residuals and feature distribution in an industrial product anomaly detection method based on a generative adversarial network according to the present invention.

图5是本发明基于生成对抗网络的工业产品异常检测方法中基于多尺度的transformer block。FIG5 is a diagram of a transformer block based on multi-scale in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图6是本发明基于生成对抗网络的工业产品异常检测方法中异常检测模型训练流程图。FIG6 is a flowchart of anomaly detection model training in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图7是本发明基于生成对抗网络的工业产品异常检测方法中基于Transformer的生成对抗网络基本结构图。FIG7 is a basic structure diagram of a Transformer-based generative adversarial network in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图8是本发明基于生成对抗网络的工业产品异常检测方法中生成高分辨率图像的生成器结构图。FIG8 is a structural diagram of a generator for generating high-resolution images in the industrial product anomaly detection method based on a generative adversarial network of the present invention.

图9是本发明基于生成对抗网络的工业产品异常检测方法中piexl shuffle的上采样过程示意图。9 is a schematic diagram of the upsampling process of the piexl shuffle in the industrial product anomaly detection method based on the generative adversarial network of the present invention.

图10为本发明的基于生成对抗网络的工业产品异常检测系统图。FIG10 is a diagram of an industrial product anomaly detection system based on a generative adversarial network according to the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Generally, the components of the embodiments of the present invention described and shown in the drawings here can be arranged and designed in various different configurations.

因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the invention claimed for protection, but merely represents selected embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters denote similar items in the following drawings, and therefore, once an item is defined in one drawing, it does not require further definition and explanation in the subsequent drawings.

下面结合附图对本发明做进一步详细描述：The present invention is further described in detail below in conjunction with the accompanying drawings:

本发明提出的一种基于生成对抗网络的工业产品异常检测方法，如图1所示，包括如下步骤：The present invention proposes an industrial product anomaly detection method based on a generative adversarial network, as shown in FIG1 , comprising the following steps:

S1、构建生成器和多尺度输入的判别器，基于生成器和多尺度输入的判别器构建生成对抗网络模型；S1. Construct a generator and a discriminator for multi-scale input, and construct a generative adversarial network model based on the generator and the discriminator for multi-scale input;

S2、将随机噪声作为生成对抗网络模型的输入，随机噪声不断学习正常产品图像中数据的分布规律，获取对抗网络模型的预测结果；S2. Use random noise as the input of the generative adversarial network model. The random noise continuously learns the distribution law of data in normal product images to obtain the prediction results of the adversarial network model.

S3、通过将工业产品的对抗网络模型的预测结果与异常分数阈值比较，实现工业产品的异常检测。S3. Anomaly detection of industrial products is achieved by comparing the prediction results of the adversarial network model of industrial products with the anomaly score threshold.

下面结合附图对一种基于生成对抗网络的工业产品异常检测方法进行详细的描述：The following is a detailed description of an industrial product anomaly detection method based on a generative adversarial network with reference to the accompanying drawings:

本发明提出一种基于Transformer的生成对抗网络，结构如图2所示。采用Transformer模型构建生成器和判别器，利用其中的self-attention机制计算当前输入序列与其他输入序列之间的权重关系，从而提升整个模型对全局上下文信息的提取能力。生成器和判别器分别以池化金字塔和多尺度输入的结构搭建而成的，对低分辨率图像(分辨率小于32×32)使用常规的Transformer模块，完成对上下文信息关系的提取，对高分辨率图像(分辨率大于32×32)采用基于窗口机制的Marked Transformer模块提升模型对局部信息的捕获能力，减少计算量的同时在不同尺度空间中提取对应特征为满足生成图像多样性提供了丰富的信息；在上采样和下采样的过程中分别引入pixelshuffle和patchembedding方法，减少模型计算中的参数量，缓解使用self-attention机制处理图像数据导致计算量过大情况的发生。The present invention proposes a Transformer-based generative adversarial network, the structure of which is shown in FIG2. The Transformer model is used to construct a generator and a discriminator, and the self-attention mechanism therein is used to calculate the weight relationship between the current input sequence and other input sequences, thereby improving the overall model's ability to extract global context information. The generator and the discriminator are respectively constructed with the structure of a pooling pyramid and a multi-scale input. For low-resolution images (resolution less than 32×32), a conventional Transformer module is used to extract context information relationships. For high-resolution images (resolution greater than 32×32), a Marked Transformer module based on a window mechanism is used to improve the model's ability to capture local information, reduce the amount of calculation, and extract corresponding features in different scale spaces to provide rich information to meet the diversity of generated images; in the process of upsampling and downsampling, the pixel shuffle and patchembedding methods are introduced to reduce the number of parameters in the model calculation, and alleviate the occurrence of excessive calculations caused by using the self-attention mechanism to process image data.

图3和图4引入编码器训练模块的结构图，由生成器生成的图像依次经过编码器的多个卷积层，下采样层，全连接层及softamx激活层生成潜在变量。将两个模块的损失函数进行加权处理后作为评估异常分数的指标，可以较好的提升生成图像的质量，从而加强方法的异常检测性能。Figures 3 and 4 introduce the structure of the encoder training module. The image generated by the generator passes through multiple convolutional layers, downsampling layers, fully connected layers and softamx activation layers of the encoder to generate latent variables. The loss functions of the two modules are weighted and used as indicators for evaluating anomaly scores, which can improve the quality of the generated images and thus enhance the anomaly detection performance of the method.

整个异常检测模型的训练流程如下：The training process of the entire anomaly detection model is as follows:

1)、在潜在空间中取一个随机噪声作为初始输入进入到生成器的MLP层，接下来将输出向量reshape成为多个维度上的特征图，即其中的每个点都对应一个维度的embedding。简单的理解是将噪声处理为token作为方法的初始输入；1) Take a random noise in the latent space as the initial input to the MLP layer of the generator, and then reshape the output vector into a feature map in multiple dimensions, that is, each point in it corresponds to an embedding of one dimension. In simple terms, the noise is processed into a token as the initial input of the method;

2)、将生成器分为多层的Transformer Block提取不同尺度空间下的特征，Transformer Block的基本结构如图5所示，每层间使用一个上采样模块提高分辨率，在低分辨率层中，采用Bicubic Upsample方法进行上采样处理；2) Divide the generator into multiple layers of Transformer Block to extract features in different scale spaces. The basic structure of Transformer Block is shown in Figure 5. An upsampling module is used between each layer to improve the resolution. In the low-resolution layer, the Bicubic Upsample method is used for upsampling.

3)、当图像的分辨率超过32×32，使用pixel shuffle模块对其进行上采样操作，将输入的低分辨率图像拉伸成多个不重叠的Patch，然后重新排序为一个更高分辨率的图像。这个方法可以在提升分辨率的同时，以减少通道数的方式缓解模型的计算量；3) When the image resolution exceeds 32×32, use the pixel shuffle module to upsample it, stretch the input low-resolution image into multiple non-overlapping patches, and then reorder them into a higher-resolution image. This method can reduce the number of channels while improving the resolution to alleviate the model's computational workload;

4)、将高分辨率图像输入到基于窗口机制的Marked Transformer模块中，即采用Marked Self-attention替换原本的self-attention机制，再通过不重叠的窗口将图像分割为大小相同的局部小图，仅在小图内进行self-attention计算，极大提升了计算效率；4) Input the high-resolution image into the Marked Transformer module based on the window mechanism, that is, use Marked Self-attention to replace the original self-attention mechanism, and then divide the image into small local images of the same size through non-overlapping windows, and perform self-attention calculations only in the small images, which greatly improves the calculation efficiency;

5)、将生成的图像以镜像结构搭建的多尺度判别器中进行训练，在输入的第一个阶段使用基于窗口的Marked Transformer模块，因为需要将二维图像转换成一维数据的形式，所以引入Patch Embedding层处理Patch，同时为了匹配不同尺度空间下提取到的特征，在多尺度判别器的每个Patch Embedding层后加入了连接层，剩余结构与生成器相仿；5) The generated image is trained in a multi-scale discriminator built with a mirror structure. In the first stage of input, a window-based Marked Transformer module is used. Because the two-dimensional image needs to be converted into one-dimensional data, the Patch Embedding layer is introduced to process the patch. At the same time, in order to match the features extracted in different scale spaces, a connection layer is added after each Patch Embedding layer of the multi-scale discriminator. The remaining structure is similar to the generator.

6)、训练编码器的第一个模块如图3所示，包含使用已训练好的参数固定生成器和编码器。设置一个随机变量作为输入到生成器中，将其生成的图像作为输入进入编码器中通过计算输入输出的潜在变量的损失函数，训练编码器在潜在空间中映射潜在变量的能力。6) The first module of the encoder training is shown in Figure 3, which includes fixing the generator and encoder with trained parameters. A random variable is set as input to the generator, and the image generated by it is used as input to the encoder. By calculating the loss function of the latent variable of the input and output, the encoder is trained to map the latent variable in the latent space.

7)、编码器的第二个训练模块如图4所示，包括参数待训练的编码器及已训练好的参数固定生成器和判别器。将真实图像输入到编码器中获得映射到潜在空间中的潜在变量，再将潜在变量作为生成器的输入生成对应的图像，计算以潜在变量生成的图像与真实图像间的Residual损失函数，同时使用判别器计算二者特征分布的MSE损失函数，对步骤6)和步骤7)步骤中的损失函数进行加权计算，选择合适的阈值作为异常分数。7) The second training module of the encoder is shown in Figure 4, including an encoder with parameters to be trained and a generator and discriminator with fixed parameters that have been trained. The real image is input into the encoder to obtain the latent variables mapped to the latent space, and then the latent variables are used as the input of the generator to generate the corresponding image. The residual loss function between the image generated by the latent variables and the real image is calculated. At the same time, the discriminator is used to calculate the MSE loss function of the feature distribution of the two. The loss functions in step 6) and step 7) are weighted and a suitable threshold is selected as the anomaly score.

基于上述方法的训练过程的描述，整个方法顺序如图6所示。Based on the description of the training process of the above method, the entire method sequence is shown in Figure 6.

随着Transformer模型逐步被应用到计算机视觉任务中，部分难以解决的问题可以通过Transformer模型来进行缓解甚至完全解决。首先是因为Transformer模型中提出self-attention机制可以很好的提取信息间的权重关系；其次通过多头注意力机制以及不同的嵌入层和解码器可以更好的处理多尺度空间中的信息；同时生成对抗网络模型可以通过生成产品图像的方式解决工业场景下难以获取异常数据的问题。使用Transformer模型和生成对抗网络都可以提高工业场景下异常检测的性能，将Transformer模型与生成对抗网络相结合，提出一种可以在多尺度空间条件下完成工业异常检测任务的模型，异常检测方法主要由以下4个部分组成：基于Transformer的生成对抗网络、基于优化计算的生成器、基于多尺度输入的判别器、损失函数以及异常检测算法。As the Transformer model is gradually applied to computer vision tasks, some difficult problems can be alleviated or even completely solved by the Transformer model. First, the self-attention mechanism proposed in the Transformer model can well extract the weight relationship between information; secondly, the multi-head attention mechanism and different embedding layers and decoders can better process information in multi-scale space; at the same time, the generative adversarial network model can solve the problem of difficulty in obtaining abnormal data in industrial scenarios by generating product images. The use of the Transformer model and the generative adversarial network can improve the performance of anomaly detection in industrial scenarios. Combining the Transformer model with the generative adversarial network, a model that can complete industrial anomaly detection tasks under multi-scale space conditions is proposed. The anomaly detection method mainly consists of the following four parts: a generative adversarial network based on Transformer, a generator based on optimization calculation, a discriminator based on multi-scale input, a loss function, and anomaly detection algorithm.

第一部分、基于Transformer的生成对抗网络Part 1: Transformer-based Generative Adversarial Network

生成对抗模型最主要的特点是由含生成器(Generator)与判别器(Discriminator)两部分组成以及不需要对训练数据进行标注，非常适用于缺乏异常标注数据的工业场景中。生成对抗网络模型的基本框架结构如图7所示。The main features of the generative adversarial model are that it consists of two parts: a generator and a discriminator, and does not require annotation of training data, making it very suitable for industrial scenarios where there is a lack of abnormal annotated data. The basic framework structure of the generative adversarial network model is shown in Figure 7.

该模型将随机噪声作为整个模型的输入，在训练过程通过生成器G将随机采样到的噪声不断学习正常产品图像中数据的分布规律，其中生成器是由Transformer Block构建而成的；在生成器G完成训练后固定其参数对判别器D进行训练，判别器D的结构与生成器G镜像相似主要任务是判断输入的产品图像由生成器G生成的假图像还是训练集中的真实图像。判别器D的目的就是为了尽可能的正确的辨别出真实产品和生成产品，而生成器G的目的是尽可能生成贴近真实工业产品的图像。实际是将每个像素视为一个变量，训练集中的产品图像是由多个像素按照一定的规律进行组合后的结果，这种规律可以通过注意力机制的权重计算进行表示；This model uses random noise as the input of the entire model. During the training process, the randomly sampled noise is continuously learned through the generator G to learn the distribution pattern of data in normal product images, where the generator is constructed by Transformer Block; after the generator G completes the training, its parameters are fixed to train the discriminator D. The structure of the discriminator D is similar to the mirror image of the generator G. The main task is to determine whether the input product image is a fake image generated by the generator G or a real image in the training set. The purpose of the discriminator D is to distinguish between real products and generated products as correctly as possible, while the purpose of the generator G is to generate images that are as close to real industrial products as possible. In fact, each pixel is regarded as a variable. The product image in the training set is the result of combining multiple pixels according to a certain rule. This rule can be represented by the weight calculation of the attention mechanism;

其中，Q代表query，KV是以key-value形式存在的键值对，query和value的维度是等长的都是dk。Attention(Q,K,V)是将query和多个key-value对映射成一种函数关系的过程，输出是以value的加权和定义，权重关系来源于计算每一个value所对应的key与query之间的相似度得到的，从而可以更好的捕捉长距离的依赖关系提取全局像素间的规律。Among them, Q represents query, KV is a key-value pair in the form of key-value, and the dimensions of query and value are equal in length and are both dk. Attention (Q, K, V) is the process of mapping query and multiple key-value pairs into a functional relationship. The output is defined as the weighted sum of values. The weight relationship is obtained by calculating the similarity between the key corresponding to each value and the query, so as to better capture long-distance dependencies and extract the laws between global pixels.

生成产品图像的形式不能过于单一化，多头注意力机制通过在不同注意力头以不同的方式计算权重，在不同尺度的空间中捕获多样化的特征信息，增强了模型的表达能力以及生成图像的多样性。The form of generating product images cannot be too single. The multi-head attention mechanism calculates weights in different ways in different attention heads, captures diverse feature information in spaces of different scales, and enhances the expressiveness of the model and the diversity of generated images.

其中，head_h代表将Q,K,V分别投影到第h个头的空间中，W^O表示将不同头的输出投影到的空间，W_i^Q,W_i^K,W_i^V表示将每一个头中的Q,K,V投影到空间中的可学习参数，所有head_h中的h＝8。Among them, head_h represents the projection of Q, K,^V into the space of the hth head respectively, W^O represents the space into which the outputs of different heads are projected,_WiQ ,_WiK^,_WiV represent the learnable parameters that project Q, K, V in each head into the space,^and h=8 in all head_h .

图像类型的数据实际上是一些变量以特定的分布规律形成的，将输入的随机噪声在模型中训练其模仿真实产品图像中的函数关系的能力，完成通过随机变量生成类似于真实产品的图像。判别器是通过构建一个二分类器来完成对输入图像的判断。整个模型以对抗训练的方式获取一个具有很强伪造能力的生成器，目标函数为：Image-type data is actually formed by some variables with a specific distribution law. The input random noise is trained in the model to imitate the functional relationship in the real product image, and the image similar to the real product is generated through random variables. The discriminator judges the input image by building a binary classifier. The whole model obtains a generator with strong forgery ability in the way of adversarial training. The objective function is:

其中，z代表作为生成器G输入的随机噪声，p_z(z)代表随机噪声z所遵循的概率分布，G(z)代表当前生成器G生成的输出，p_data代表真实产品图像所服从的概率分布，D(x)代表当前输入图像是真实图像的概率，并且D(x)∈[0,1]。实际上生成器生成图像的过程是通过学习真实产品图像x的分布p_g。中的/>代表真实产品图像的分布期望，也可以理解为用于表示当前判别器的判别能力，中的/>代表噪声的分布期望。G和D是同时进行训练的，而G需要最小化log(1-D(G(z)))这一项，实际意义是让判别器无法区分出当前输入的样本是训练集中的数据还是生成的产品图像。公式中的min前缀表示使G的代价函数最小，即生成产品尽量贴近真实产品，D的代价函数最大，即判别器不能判断当前输入的图像是否是真正的产品图像。从而整个函数可以定义为以下两个部分：Where z represents the random noise as the input of the generator G, p_z (z) represents the probability distribution followed by the random noise z, G(z) represents the output generated by the current generator G, p_data represents the probability distribution followed by the real product image, D(x) represents the probability that the current input image is a real image, and D(x)∈[0,1]. In fact, the process of the generator generating an image is to learn the distribution p_g of the real product image x. In/> Represents the expected distribution of real product images, which can also be understood as representing the discriminative ability of the current discriminator. In/> Represents the expected distribution of noise. G and D are trained simultaneously, and G needs to minimize the term log(1-D(G(z))). The practical meaning is to make the discriminator unable to distinguish whether the current input sample is data in the training set or the generated product image. The min prefix in the formula means to minimize the cost function of G, that is, the generated product is as close to the real product as possible, and maximize the cost function of D, that is, the discriminator cannot determine whether the current input image is a real product image. Therefore, the whole function can be defined as the following two parts:

由公式(4)可知，当固定G的参数时最大化判别器的损失函数，需要将和/>都取最大值，代表对于真实样本判别器可以准确的识别出当前输入的样本是真实图像还是生成图像，从而得到性能较好的判别器。公式(5)中固定D的参数最小化生成器的损失函数，需要将/>和都取最小值，此时的D是固定参数但需要最小值即生成器生成的假产品可以骗过判别器，从而使生成器的性能达到最好。完成训练后，需要获取到具有很强性能的生成器和判别器。From formula (4), we can see that when the parameters of G are fixed, to maximize the loss function of the discriminator, we need to and/> The maximum value is taken, which means that the real sample discriminator can accurately identify whether the current input sample is a real image or a generated image, thus obtaining a discriminator with better performance. In formula (5), the parameter D is fixed to minimize the loss function of the generator. It is necessary to set / > and All take the minimum value. At this time, D is a fixed parameter but needs the minimum value, that is, the fake products generated by the generator can deceive the discriminator, so that the performance of the generator reaches the best. After completing the training, it is necessary to obtain a generator and discriminator with strong performance.

第二部分、基于优化计算的生成器Part II: Generator based on optimized calculation

为了解决使用Trasformer模型在处理高分辨率图像时计算成本较高的问题，提出了一种基于优化计算的生成器。参考基于CNN构建的生成对抗网络模型的设计理念，提出一种基于Transformer的池化金字塔结构生成器，在对高分辨率图像进行处理的过程中采用基于窗口的Marked Self-attention机制对其他窗口进行掩码处理，可以大大减少计算过程对内存和算力的消耗，对高分辨率图像使用piexlshuffle上采样方法，提升分辨率的同时减少与其相应的通道数，进一步减少参与计算的参数量。In order to solve the problem of high computational cost when using the Transformer model to process high-resolution images, a generator based on optimized calculation is proposed. Referring to the design concept of the generative adversarial network model built on CNN, a pooling pyramid structure generator based on Transformer is proposed. In the process of processing high-resolution images, a window-based Marked Self-attention mechanism is used to mask other windows, which can greatly reduce the consumption of memory and computing power in the calculation process. The piexlshuffle upsampling method is used for high-resolution images to improve the resolution while reducing the number of corresponding channels, further reducing the number of parameters involved in the calculation.

如图8所示是对高分辨图像进行处理时的基于Transformer的优化计算的生成器。定义低分辨率的产品图像的最大分辨率为32×32，大于此分辨率的图像定义为高分辨率图像，高分辨率图像所产生的序列超过1024，会大大增加模型的计算成本而导致无法使用单机完成训练。整个模型中的生成器G是通过多个该模块堆叠而成的，以池化金字塔形式分阶段的逐渐提升其分辨率，直到将其提升到目标分辨率256×256。Figure 8 shows the Transformer-based optimized calculation generator for processing high-resolution images. The maximum resolution of low-resolution product images is defined as 32×32, and images larger than this resolution are defined as high-resolution images. The sequence generated by high-resolution images exceeds 1024, which will greatly increase the computational cost of the model and make it impossible to complete the training using a single machine. The generator G in the entire model is composed of multiple stacked modules, and its resolution is gradually increased in stages in the form of a pooling pyramid until it is increased to the target resolution of 256×256.

基于生成对抗网络的工业产品异常检测方法将随机噪声作为输入，噪声输入后进入MLP层，一方面可以通过对噪声进行一系列的线性和非线性变化增加多种变换形式，可以增强生成过程中对不同空间下数据分布规律的探索，从而在一定程度上提升产品生成图像的多样性；另一方面可以对生成产品的特征进行预调整，去除噪声中的冗余信息，保留更重要的特征信息，从而使得生成器可以更容易的在特征中学习真实产品的数据分布规律。经过MLP层处理后的噪声会转换成一个长度为h₀×w₀×c的向量，定义h₀＝w₀＝8，接下来会对这个长度为64的一维tokens与其相应的位置编码进行结合，具体在图中的表示是其中下对角线阴影框代表pixel embedding(像素的嵌入表示)，上对角线阴影方框表示positionalencoding(位置编码)。由于Transformer不能处理二维的特征图数据，从而可以将其视为一个长度为h₀×w₀的序列数据，同时这个序列上有h₀×w₀个word，每个word就是一个c维的向量。The industrial product anomaly detection method based on generative adversarial network takes random noise as input. After the noise is input, it enters the MLP layer. On the one hand, it can add multiple transformation forms by performing a series of linear and nonlinear changes on the noise, which can enhance the exploration of data distribution laws in different spaces during the generation process, thereby improving the diversity of product generated images to a certain extent; on the other hand, it can pre-adjust the features of the generated products, remove redundant information in the noise, and retain more important feature information, so that the generator can more easily learn the data distribution laws of real products in the features. The noise processed by the MLP layer will be converted into a vector of length h₀ ×w₀ ×c, and h₀ =w₀ =8 is defined. Next, this one-dimensional tokens with a length of 64 will be combined with its corresponding positional encoding. The specific representation in the figure is that the lower diagonal shaded box represents pixel embedding (pixel embedding representation), and the upper diagonal shaded box represents positional encoding (positional encoding). Since Transformer cannot process two-dimensional feature map data, it can be regarded as a sequence data of length h₀ ×w_0. There are h₀ ×w₀ words in this sequence, and each word is a c-dimensional vector.

此时的产品图像分辨率为8×8，尚未达到高分辨率的限制要求，所以仅需要通过常规的Transformer Block对其进行处理，如图2所示，其基本结构框架类似于oneEncoderLayer，分别由两个layer norm，一个multi-head self-attention，一个MLP以及残差连接所组成，完成数据处理后，输入到上采样模块提高分辨率，分别由一个reshape操作和一个上采样处理组成，首先将一维特征reshape成一个二维特征图然后进行常规的bicubic upscaling操作完成上采样，最后再将其reshape为一维序列数据形式。不断重复以上操作，分阶段的完成对图像分辨率的提升以及对图像的生成。At this time, the product image resolution is 8×8, which has not yet reached the high-resolution limit, so it only needs to be processed by the conventional Transformer Block. As shown in Figure 2, its basic structural framework is similar to oneEncoderLayer, which consists of two layer norms, a multi-head self-attention, an MLP and a residual connection. After completing the data processing, it is input to the upsampling module to increase the resolution, which consists of a reshape operation and an upsampling process. First, the one-dimensional feature is reshaped into a two-dimensional feature map. Then perform the conventional bicubic upscaling operation to complete the upsampling, and finally reshape it into a one-dimensional sequence data form. Repeat the above operations continuously to complete the image resolution improvement and image generation in stages.

当输入的数据为32×32×c大小，需要使用如图8所示的生成器模块，受到swim-transformer中窗口机制的启示，将当前的高分辨率图像的全局特征映射划分为多个不重叠的窗口，并提出一种基于窗口化的Marked Self-attention机制。不需要计算每个token之间的权重关系，只需要计算当前窗口内tokens之间的权重关系，大大减少了在计算高分辨率图像过程中所消耗的资源。在每个窗口中计算注意力权重，再将窗口内的特征向量按权重进行平均加权，得到一个窗口级别的特征向量，最终将多个窗口级别的特征向量进行拼接，作为下一层的输入。When the input data is 32×32×c in size, the generator module shown in Figure 8 needs to be used. Inspired by the window mechanism in swim-transformer, the global feature map of the current high-resolution image is divided into multiple non-overlapping windows, and a window-based Marked Self-attention mechanism is proposed. There is no need to calculate the weight relationship between each token, only the weight relationship between tokens in the current window needs to be calculated, which greatly reduces the resources consumed in the process of calculating high-resolution images. The attention weight is calculated in each window, and the feature vectors in the window are averaged according to the weight to obtain a window-level feature vector. Finally, multiple window-level feature vectors are spliced as the input of the next layer.

作为高分辨率图像的上采样层，考虑引入优化计算量的piexl shuffle方法。如果对高分辨率图像使用传统上采样方法提升分辨率，一方面需要消耗大量的内存和算力，另一方面易造成图像模糊和失真。piexl shuffle的原理是通过对像素进行重新排序后进行上采样，在避免上述问题发生的同时，还可以保留更多的空间信息。整个过程如图9所示。As the upsampling layer of high-resolution images, the piexl shuffle method is considered to be introduced to optimize the amount of calculation. If the traditional upsampling method is used to improve the resolution of high-resolution images, on the one hand, a large amount of memory and computing power are consumed, and on the other hand, the image is easily blurred and distorted. The principle of piexl shuffle is to reorder the pixels before upsampling, which can avoid the above problems while retaining more spatial information. The whole process is shown in Figure 9.

Piexl shuffle实际上仅在末端进行上采样，其中表示高分辨率图像，/>表示输入的低分辨率图像，I^SR表示从低分辨率图像中恢复出来的高分辨率图像，具体过程可以分为两大部分。首先应用一个l层的卷积神经网络到LR(低分辨率图像)，接下来应用一个sub-pixel(亚像素卷积)层来上采样LR的特征图从而产生SR图像。对于L层的网络，前L-1层被定义为：The Piexl shuffle actually only upsamples at the end, where Represents a high-resolution image, /> I represents the input low-resolution image, and I^SR represents the high-resolution image recovered from the low-resolution image. The specific process can be divided into two parts. First, a l-layer convolutional neural network is applied to LR (low-resolution image), and then a sub-pixel convolution layer is applied to upsample the feature map of LR to produce the SR image. For an L-layer network, the first L-1 layers are defined as:

其中，W和b分别是权重和偏移，l∈[1,L-1]表示当前是L-1层中的一层，W_l是一个二维的卷积张量，其大小为n_l-1×n_l×k_l×k_l，n_l的定义是第l层中特征的个数，k_l的定义是第l层中的卷积核大小，b_l的定义是维度为n_l的一维向量，表示此函数为非线性激活函数。Among them, W and b are weights and biases respectively, l∈[1,L-1] means that the current layer is in the L-1 layer, W_l is a two-dimensional convolution tensor with a size of n_l-1 ×n_l ×k_l ×k_l , n_l is defined as the number of features in the lth layer, k_l is defined as the convolution kernel size in the lth layer, and b_l is defined as a one-dimensional vector with a dimension of n_l . Indicates that this function is a non-linear activation function.

最后一层f^L需要将LR的特征图转换到一个HR图像I^SR，即：The last layer f^L needs to convert the LR feature map into an HR image I^SR , that is:

I^SR＝f^L(I^LR)＝PS(W_L*f^L-1(I^LR)+b_L) (8)I^SR = f^L ( I^LR ) = PS ( W_L * f^{L - 1} ( I^LR ) + b_L ) (8)

其中，PS是一个周期性洗牌算子，它会对当前张量H×W×C·r²中所包含的元素进行重新排序，调整为rH×rW×C的新排列顺序的张量，实际效果与图9中的末端表示一致。同时，通过数学形式来表示整个过程的定义如下：Among them, PS is a periodic shuffle operator, which reorders the elements contained in the current tensor H×W×C·r² to adjust it to a tensor with a new arrangement order of rH×rW×C. The actual effect is consistent with the terminal representation in Figure 9. At the same time, the definition of the whole process in mathematical form is as follows:

由公式(6)、(7)可知当前卷积算子W_L的形状是n_L-1×r²C×k_L×k_L，其中r表示需要进行上采样的倍数，c表示输出图像的通道数，当k_L＝k_s/r并且mod(k_s,r)＝0时，实际上就是在LR空间中使用W_s的亚像素卷积，当c>1时，最终输出的图在多通道情况下，将通道数整合成一个整体，实际上就是将特征图通道数中连续的c个通道作为一个整体，然后对其进行像素重新排序，得到多通道的上采样图。简单来讲，是将当前图像的多通道平铺成单通道图像，以周期算子的运算规律对平铺的像素打乱其原有的顺序进行重新排序，最终得到的图像既提高了分辨率又减少了通道数。From formulas (6) and (7), we can see that the shape of the current convolution operator W_L is n_L-1 × r² C × k_L × k_L , where r represents the multiple of upsampling required, and c represents the number of channels of the output image. When k_L = k_s / r and mod(k_s , r) = 0, it is actually a sub-pixel convolution using W_s in the LR space. When c>1, the final output image integrates the number of channels into a whole in the case of multiple channels. In fact, the continuous c channels in the number of feature map channels are taken as a whole, and then the pixels are reordered to obtain a multi-channel upsampled image. In simple terms, the multi-channels of the current image are tiled into a single-channel image, and the tiled pixels are reordered in a disrupted order according to the operation rules of the periodic operator. The final image has both improved resolution and reduced channel number.

不断重复进行上述操作，通过多个基于计算优化的Transformer Block完成对生成产品图像的分阶段提升分辨率的操作，实现最终输出分辨率为256×256的产品图像。The above operations are repeated continuously, and the resolution of the generated product image is improved in stages through multiple Transformer Blocks based on computational optimization, so that the final output resolution of the product image is 256×256.

第三部分、基于多尺度输入的判别器Part III: Discriminator based on multi-scale input

在GAN网络的训练过程中，判别器的目的是对当前输入的产品图像进行分类，判断输入是真实产品图像还是生成产品的图像。基于Transformer Block构建的判别器，由于其所采用的self-attention机制，对提取的全局上下文特征方面的处理有更好的性能；仅对全局信息有更好的捕获能力还是不够的，如果对细粒度的局部特征没有较好的提取能力，会导致生成图像在细节上还原度不够的问题。During the training process of the GAN network, the discriminator's purpose is to classify the current input product image and determine whether the input is a real product image or a generated product image. The discriminator built based on the Transformer Block has better performance in processing the extracted global context features due to its self-attention mechanism. It is not enough to have a better ability to capture global information. If there is no good ability to extract fine-grained local features, the generated image will not be able to restore the details.

为了解决上述问题，本发明提出了一种基于多尺度输入的判别器，在不同尺度下将输入的图像分割成大小不同的Patch。判别器与生成器间的任务不同，本质上判别器是一个分类器，用于区分生成图像和真实图像，不同大小的Patch会对结果有很大的影响，较大的Patch中局部信息较少，Patch太小会导致计算量过大；所以当Patch较小时，每块图像中的底层细节信息较多。采用基于窗口的Marked Self-attention机制，使判别器可以更好的提取Patch中局部特征的同时减少计算量；当输入图像块较大时，使用标准的TransformerBlock让判别器更好的捕捉产品中的不同Patch间的全局上下文信息。针对不同输入尺度下Patch中所包含的信息侧重点不同的特点，设计了基于多尺度输入的判别器。In order to solve the above problems, the present invention proposes a discriminator based on multi-scale input, which divides the input image into patches of different sizes at different scales. The tasks of the discriminator and the generator are different. In essence, the discriminator is a classifier used to distinguish generated images from real images. Patches of different sizes will have a great impact on the results. Larger patches contain less local information, and patches that are too small will result in excessive calculations; so when the patch is small, each image has more underlying detail information. The window-based Marked Self-attention mechanism is adopted to enable the discriminator to better extract local features in the patch while reducing the amount of calculation; when the input image block is large, the standard TransformerBlock is used to allow the discriminator to better capture the global contextual information between different patches in the product. In view of the different emphasis of the information contained in the patch at different input scales, a discriminator based on multi-scale input is designed.

如图3所示，多尺度判别器主要分为3个不同尺度来进行对全局特征与局部特征的提取。首先，由于生成图像的分辨率为256×256，所以将Patch定义为P＝16×16，不同尺度中Patch的大小为：(P,2P,4P)，将输入的生成图像Y∈R^H×W×3分成3个不同的序列。初始输入是通过VIT中的Embedding层将其Embedding到向量空间中，对图像进行拉伸完成对其降维成序列输入，并在其头部添加一个CLS token，预先设定一个值作为定义当前产品图像的类型，最终判别器可以仅保留CLS token用于表示分类结果。As shown in Figure 3, the multi-scale discriminator is mainly divided into three different scales to extract global features and local features. First, since the resolution of the generated image is 256×256, the patch is defined as P=16×16, and the size of the patch in different scales is: (P, 2P, 4P), and the input generated image Y∈R^H×W×3 is divided into three different sequences. The initial input is embedded into the vector space through the Embedding layer in VIT, the image is stretched to reduce its dimension to a sequence input, and a CLS token is added to its head, and a value is pre-set as the type of the current product image. Finally, the discriminator can only retain the CLS token to represent the classification result.

整个过程中将第一个输入的序列信息(H/P×W/P)×3通过Embedding层变换成(H/P×W/P)×C/4，对其进行基于窗口的Marked Self-attention机制的处理，在当前尺度上主要提取产品图像中的局部特征信息，不仅可以在计算优化方面提升模型的适用性，同时也避免因采用self-attention机制带来缺乏对局部信息的提取能力的问题发生。最终将这一层输出的结果，经过swim-transformer中的patch merging层进行下采样，实际上这一层可以理解为是进行一个逆向的piexl shuffle方法来对输出进行一个下采样，完成后与第二阶段的输入(H/2P×W/2P)×C/2相连接，作为下一阶段的输入，同时在这个尺度上就是通过标准的Transformer Block完成对全局上下文信息的提取，并且将其输出经过平均池化层后与第三阶段的输入(H/4P×W/4P)×C/2相连接。对三个尺度的特征进行融合后，保证模型既对局部特征信息有较好的提取能力，还对全局上下文信息有好的提取能力。最终通过保留用于存储异常分类信息的CLS头部信息，通过分类头输出分类结果。In the whole process, the first input sequence information (H/P×W/P)×3 is transformed into (H/P×W/P)×C/4 through the Embedding layer, and the window-based Marked Self-attention mechanism is used to extract the local feature information in the product image at the current scale, which can not only improve the applicability of the model in terms of computational optimization, but also avoid the problem of lack of ability to extract local information due to the use of the self-attention mechanism. Finally, the output of this layer is downsampled through the patch merging layer in the swim-transformer. In fact, this layer can be understood as a reverse piexl shuffle method to downsample the output. After completion, it is connected with the input (H/2P×W/2P)×C/2 of the second stage as the input of the next stage. At the same time, at this scale, the global context information is extracted through the standard Transformer Block, and its output is connected with the input (H/4P×W/4P)×C/2 of the third stage after passing through the average pooling layer. After fusing the features of the three scales, the model is guaranteed to have good extraction capabilities for both local feature information and global context information. Finally, the classification result is output through the classification header by retaining the CLS header information used to store abnormal classification information.

第四部分、损失函数以及异常检测算法Part 4. Loss Function and Anomaly Detection Algorithm

前面三部分描述了基于Transformer的生成对抗网络训练过程属于方法的第一阶段训练，主要目的是为了得到已经训练好的生成器和判别器，从而在两者的帮助下，完成对第二阶段的训练，最终完成整个异常检测模型的训练。The previous three parts describe the training process of the Transformer-based generative adversarial network, which belongs to the first stage of training of the method. The main purpose is to obtain the trained generator and discriminator, so as to complete the training of the second stage with the help of the two, and finally complete the training of the entire anomaly detection model.

使用生成对抗网络进行异常检测任务，生成器生成图像的本质是通过学习潜在变量的分布规律来生成产品图像。在AnoGAN模型中提出虽然可以直接使用训练好的判别器进行异常检测任务，但其检测结果并不是很理想，所以提出利用潜在变量进一步提升检测结果的精度，通过不断在潜在空间中迭代搜索到最优潜在变量，与生成对抗网络相结合并实现异常检测任务，但是对不同异常类别的数据识别效果不佳，造成这一问题的主要原因是因为受限于传统深度学习模型不能很好的提取多尺度全局信息，而导致在异常检测过程中对不同类别的异常的识别能力较差。Generative adversarial networks are used for anomaly detection tasks. The essence of the generator generating images is to generate product images by learning the distribution law of latent variables. In the AnoGAN model, it is proposed that although the trained discriminator can be used directly for anomaly detection tasks, its detection results are not very ideal. Therefore, it is proposed to use latent variables to further improve the accuracy of the detection results. By continuously iteratively searching for the optimal latent variables in the latent space, it is combined with the generative adversarial network to realize the anomaly detection task. However, the recognition effect of data of different anomaly categories is not good. The main reason for this problem is that it is limited by the fact that traditional deep learning models cannot extract multi-scale global information well, resulting in poor recognition of different categories of anomalies during the anomaly detection process.

为了解决上述问题，在对此异常检测模型进行一定程度优化的基础上与基于Transformer的生成对抗网络相结合，将训练过程分为3个模块，通过最终对损失函数进行加权计算，完成异常检测任务。其中第一个模块是通过训练得到具有较强性能的生成器G和判别器D，接下来的部分主要介绍剩余两个模块的主要内容。首先整个模型的训练过程中是仅使用正常的工业产品进行训练。在是生成对抗网络的训练过程中生成器G实际上是其中x∈I^H×W×C表示生成的工业产品，/>表示在d维潜在空间中的潜在变量z。通过生成对抗网络映射潜在变量从本质上来讲是与其原理相逆的操作，所以通过生成对抗网络直接获取图像对应的潜在变量是一个难以实现的过程。但是，可以将完成训练的生成器与编码器组成一个新的模块，初始化一个随机噪声作为输入，同时将其生成图像作为输入训练编码器在潜在空间中搜寻与之对应的潜在变量的能力。整个模块设计如图4所示。In order to solve the above problems, this anomaly detection model is optimized to a certain extent and combined with a Transformer-based generative adversarial network. The training process is divided into three modules, and the anomaly detection task is completed by weighted calculation of the loss function. The first module is to obtain a generator G and a discriminator D with strong performance through training. The following section mainly introduces the main contents of the remaining two modules. First of all, only normal industrial products are used for training during the training process of the entire model. In the training process of the generative adversarial network, the generator G is actually Where x∈I^H×W×C represents the generated industrial products,/> Represents the latent variable z in the d-dimensional latent space. Mapping latent variables through a generative adversarial network is essentially an operation that is contrary to its principle, so it is difficult to directly obtain the latent variables corresponding to the image through a generative adversarial network. However, the trained generator and encoder can be combined into a new module, initialize a random noise as input, and use its generated image as input to train the encoder's ability to search for the corresponding latent variables in the latent space. The entire module design is shown in Figure 4.

整个模块的左半部分实际上只是将参数进行固定之后，采用随机噪声映射到工业产品图像空间的一个过程，而接下来需要对编码器进行训练。编码器的主要任务是将生成的产品图像再映射回潜在空间中，生成产品所对应的潜在变量z。整个训练过程中，采用的生成器G中的参数是固定的，即随机初始的潜在变量z的生成过程的映射是一定的。训练的目标是为了最小化初始潜在变量z和经过编码器重构的潜在变量z之间的均方误差(MSE)：The left half of the entire module is actually just a process of fixing the parameters and mapping them to the industrial product image space using random noise. The encoder needs to be trained next. The main task of the encoder is to map the generated product image back to the latent space and generate the latent variable z corresponding to the product. During the entire training process, the parameters of the generator G used are fixed, that is, the random initial latent variable z is The mapping of the generation process is certain. The goal of training is to minimize the mean square error (MSE) between the initial latent variable z and the latent variable z reconstructed by the encoder:

其中，E(G(z))表示编码器将生成图像作为输入将其映射回潜在空间的映射过程，d的定义是潜在空间Z的维度数，此模块中虽然可以通过迭代训练优化生成产品近似潜在变量与其对应的潜在变量间的距离。仍然会存在问题，生成的工业产品图像即便可以骗过具有良好判别能力的判别器，但是生成产品的潜在变量仍与真实产品的潜在变量仍存在一定的差距。所以需要引入第三个模块，弥补这个训练过程中出现的不足。Among them, E(G(z)) represents the mapping process of the encoder taking the generated image as input and mapping it back to the latent space, and d is defined as the number of dimensions of the latent space Z. Although the product approximate latent variables can be generated through iterative training optimization in this module There is still a problem. Even if the generated industrial product image can fool the discriminator with good discrimination ability, there is still a certain gap between the latent variables of the generated product and the latent variables of the real product. Therefore, the third module needs to be introduced to make up for the deficiency in the training process.

第三个模块如图5所示，在这个模块中，生成器G和判别器D中的参数同样是固定的，将真实产品作为编码器的输入，从而可以更好的解决第二个模块中因为仅通过生成图像训练编码器而造成的不足；然后将所查询到的潜在变量作为输入，进入到参数已经固定的生成器G中，并且将输入再重新映射回图像空间中，同时最小化生成产品图像与真实产品图像之间均方误差(MSE)为：The third module is shown in Figure 5. In this module, the parameters of the generator G and the discriminator D are also fixed. The real product is used as the input of the encoder, which can better solve the shortcomings of the second module caused by training the encoder only by generating images. Then the queried latent variables are used as input to the generator G with fixed parameters, and the input is remapped back to the image space, while minimizing the mean square error (MSE) between the generated product image and the real product image:

其中，x表示作为输入的真实产品，n表示其中的像素数，||·||用于定义图像中灰度值的残差平方和。如果仅通过计算该损失值作为判定异常值的指标，可能会影响最终的检测结果。因为真实产品在潜在空间Z中的映射位置并不确定，是通过对残差值进行间接运算获取的，整个过程中可以能会因为在映射过程中存在信息损失或是因为残差量的计算过程中出现误差，而影响最终的精度。为了减少上述情况对精度产生的不良影响，在完成公式(11)中的损失值计算后，还分别将真实产品和生成产品分别输入到判别器中，将二者的特征分布值进行均方误差计算，并将其作为新的损失函数，辅助公式(11)减小可能出现的误差对整个方法精度的影响。Among them, x represents the real product as input, n represents the number of pixels in it, and ||·|| is used to define the residual sum of squares of grayscale values in the image. If only the loss value is calculated as an indicator for determining outliers, it may affect the final detection results. Because the mapping position of the real product in the latent space Z is uncertain, it is obtained by indirect calculation of the residual value. The whole process may affect the final accuracy due to information loss in the mapping process or errors in the calculation of the residual amount. In order to reduce the adverse effects of the above situation on the accuracy, after completing the loss value calculation in formula (11), the real product and the generated product are respectively input into the discriminator, and the mean square error of the feature distribution values of the two is calculated, and it is used as a new loss function to assist formula (11) to reduce the impact of possible errors on the accuracy of the entire method.

本发明提出的一种基于生成对抗网络的工业产品异常检测系统，如图10所示，包括模型构建模块、预测结果获取模块和结果比对模块；The present invention proposes an industrial product anomaly detection system based on a generative adversarial network, as shown in FIG10 , comprising a model building module, a prediction result acquisition module and a result comparison module;

所述模型构建模块用于构建生成器和多尺度输入的判别器，基于生成器和多尺度输入的判别器构建生成对抗网络模型；The model building module is used to build a generator and a discriminator of multi-scale input, and build a generative adversarial network model based on the generator and the discriminator of multi-scale input;

所述预测结果获取模块用于将随机噪声作为生成对抗网络模型的输入，随机噪声不断学习正常产品图像中数据的分布规律，获取对抗网络模型的预测结果；The prediction result acquisition module is used to use random noise as the input of the generative adversarial network model, and the random noise continuously learns the distribution law of data in the normal product image to obtain the prediction result of the adversarial network model;

所述结果比对模块用于通过将工业产品的对抗网络模型的预测结果与异常分数阈值比较，实现工业产品的异常检测。The result comparison module is used to achieve anomaly detection of industrial products by comparing the prediction results of the adversarial network model of the industrial products with the anomaly score threshold.

本发明实施例提供的终端设备，该实施例的终端设备包括：处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序。所述处理器执行所述计算机程序时实现上述各个方法实施例中的步骤。或者，所述处理器执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能。The terminal device provided in an embodiment of the present invention comprises: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the steps in the above-mentioned method embodiments are implemented. Alternatively, when the processor executes the computer program, the functions of the modules/units in the above-mentioned device embodiments are implemented.

所述计算机程序可以被分割成一个或多个模块/单元，所述一个或者多个模块/单元被存储在所述存储器中，并由所述处理器执行，以完成本发明。The computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory and executed by the processor to accomplish the present invention.

所述终端设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括，但不仅限于，处理器、存储器。The terminal device may be a computing device such as a desktop computer, a notebook, a PDA, a cloud server, etc. The terminal device may include, but is not limited to, a processor and a memory.

所述处理器可以是中央处理单元(Central Processing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。The processor may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

所述存储器可用于存储所述计算机程序和/或模块，所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块，以及调用存储在存储器内的数据，实现所述终端设备的各种功能。The memory may be used to store the computer program and/or module, and the processor implements various functions of the terminal device by running or executing the computer program and/or module stored in the memory and calling the data stored in the memory.

所述终端设备集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实现上述实施例方法中的全部或部分流程，也可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一计算机可读存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，所述计算机程序包括计算机程序代码，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括：能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是，所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减，例如在某些司法管辖区，根据立法和专利实践，计算机可读介质不包括电载波信号和电信信号。If the module/unit integrated in the terminal device is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer program can implement the steps of the above-mentioned various method embodiments when executed by the processor. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium. It should be noted that the content contained in the computer-readable medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.