CN113052159B

Movatterモバイル変換

Info

Publication number: CN113052159B
Application number: CN202110400954.1A
Authority: CN
Inventors: 林东青; 马军; 陈涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shaanxi Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shaanxi Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2024-06-07
Anticipated expiration: 2041-04-14
Also published as: CN113052159A

Abstract

The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment and a computer storage medium, relates to the field of image detection, and aims to improve the accuracy of image recognition. The method comprises the following steps: acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified; inputting an image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified; inputting an image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified; carrying out feature fusion on text features of the image to be identified, a pooling feature map of at least one object to be identified and spatial relationship features, and determining a shared feature image corresponding to the image to be identified; and inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.

Description

Translated fromChinese

一种图像识别方法、装置、设备及计算机存储介质Image recognition method, device, equipment and computer storage medium

技术领域Technical Field

本申请属于图像检测领域，尤其涉及一种图像识别方法、装置、设备及计算机存储介质。The present application relates to the field of image detection, and in particular to an image recognition method, device, equipment and computer storage medium.

背景技术Background technique

在图像中识别目标对象是计算机视觉领域的重要研究方向之一，在公共安全、道路交通、视频监控等领域均有着重要的作用。现有技术中，可以利用图像中的目标对象的空间关系特征，对上述目标对象进行识别；也可以通过对神经网络中的图像特征权重进行合理匹配，以提高上述神经网络对目标对象的识别精度。Identifying target objects in images is one of the important research directions in the field of computer vision, and plays an important role in the fields of public safety, road traffic, video surveillance, etc. In the prior art, the spatial relationship features of the target objects in the image can be used to identify the target objects; the image feature weights in the neural network can also be reasonably matched to improve the recognition accuracy of the neural network for the target objects.

但现有技术中，由于图像所包含场景的复杂多样性和图像中待检测目标位置的不确定性，无法适应更多场景，进而导致无法提高图像识别的准确率。However, in the prior art, due to the complexity and diversity of the scenes contained in the images and the uncertainty of the positions of the targets to be detected in the images, it is impossible to adapt to more scenes, which in turn leads to the inability to improve the accuracy of image recognition.

发明内容Summary of the invention

本申请实施例提供一种图像识别方法、装置、设备及计算机存储介质，用以提高图像识别的准确率。The embodiments of the present application provide an image recognition method, apparatus, device and computer storage medium to improve the accuracy of image recognition.

第一方面，本申请实施例提供一种图像识别方法，方法包括：In a first aspect, an embodiment of the present application provides an image recognition method, the method comprising:

获取待识别图像，待识别图像中有至少一个待识别对象；Acquire an image to be identified, wherein the image to be identified contains at least one object to be identified;

将待识别图像输入至预先训练的图像识别模型中的第一网络，确定待识别图像的文本特征；Inputting the image to be recognized into the first network of the pre-trained image recognition model to determine the text features of the image to be recognized;

将待识别图像输入至图像识别模型中的第二网络，确定至少一个待识别对象的池化特征图和空间关系特征；Inputting the image to be identified into a second network in the image recognition model to determine a pooled feature map and a spatial relationship feature of at least one object to be identified;

对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行特征融合，确定与待识别图像对应的共享特征图像；Performing feature fusion on text features of the image to be recognized, a pooled feature map of at least one object to be recognized, and spatial relationship features to determine a shared feature image corresponding to the image to be recognized;

将共享特征图像输入至图像识别模型中的第三网络，确定待识别图像的识别信息，识别信息包括每一待识别对象的类别信息和位置信息。The shared feature image is input into the third network in the image recognition model to determine the recognition information of the image to be recognized, where the recognition information includes the category information and location information of each object to be recognized.

第二方面，本申请实施例提供一种图像识别装置，装置包括：In a second aspect, an embodiment of the present application provides an image recognition device, the device comprising:

第一获取模块，用于获取待识别图像，待识别图像中有至少一个待识别对象；A first acquisition module is used to acquire an image to be identified, wherein the image to be identified contains at least one object to be identified;

第一确定模块，用于将待识别图像输入至预先训练的图像识别模型中的第一网络，确定待识别图像的文本特征；A first determination module, used for inputting the image to be recognized into a first network in a pre-trained image recognition model to determine text features of the image to be recognized;

第二确定模块，用于将待识别图像输入至图像识别模型中的第二网络，确定至少一个待识别对象的池化特征图和空间关系特征；A second determination module, used for inputting the image to be identified into a second network in the image recognition model, and determining a pooling feature map and a spatial relationship feature of at least one object to be identified;

融合模块，用于对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行特征融合，确定与待识别图像对应的共享特征图像；A fusion module, used to perform feature fusion on the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features, and determine a shared feature image corresponding to the image to be recognized;

识别模块，用于将共享特征图像输入至图像识别模型中的第三网络，确定待识别图像的识别信息，识别信息包括每一待识别对象的类别信息和位置信息。The recognition module is used to input the shared feature image into the third network in the image recognition model to determine the recognition information of the image to be recognized, and the recognition information includes the category information and location information of each object to be recognized.

第三方面，本申请实施例提供了一种图像识别设备，设备包括：In a third aspect, an embodiment of the present application provides an image recognition device, the device comprising:

处理器，以及存储有计算机程序指令的存储器；处理器读取并执行计算机程序指令，以实现如本申请实施例第一方面所提供的图像识别方法。A processor, and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method provided in the first aspect of the embodiment of the present application.

第四方面，本申请实施例提供了一种计算机存储介质，计算机存储介质上存储有计算机程序指令，计算机程序指令被处理器执行时实现如本申请实施例第一方面所提供的图像识别方法。In a fourth aspect, an embodiment of the present application provides a computer storage medium on which computer program instructions are stored. When the computer program instructions are executed by a processor, an image recognition method as provided in the first aspect of the embodiment of the present application is implemented.

本申请实施例提供的图像识别方法，提取待检测图像的文本特征、以及上述待检测图像中至少一个第一目标对象的池化特征图和空间关系特征，并将上述三个特征进行特征融合，将融合后的共享特征图，输入至图像识别模型中的第三网络，确定待识别图像的识别信息，识别信息包括每一待识别对象的类别信息和位置信息。相比于现有技术，通过特征融合实现图像信息的互补，在避免冗余噪声的同时，弥补了图像特征信息在细节和场景上的不足，同时文本特征的提取，能够反应图像在不同场景下的差异与共性，进而能够适用于更多复杂场景，并提高图像识别的准确率。The image recognition method provided in the embodiment of the present application extracts the text features of the image to be detected, as well as the pooled feature map and spatial relationship features of at least one first target object in the above-mentioned image to be detected, and performs feature fusion on the above-mentioned three features, and inputs the fused shared feature map into the third network in the image recognition model to determine the recognition information of the image to be recognized, and the recognition information includes the category information and location information of each object to be recognized. Compared with the prior art, the complementarity of image information is achieved through feature fusion, while avoiding redundant noise, it makes up for the deficiencies of image feature information in details and scenes. At the same time, the extraction of text features can reflect the differences and commonalities of images in different scenes, and thus can be applied to more complex scenes and improve the accuracy of image recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例中所需要使用的附图作简单的介绍，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solution of the embodiments of the present application, the following is a brief introduction to the drawings required for use in the embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without any creative work.

图1是本申请实施例提供的一种图像识别模型的训练方法的流程示意图；FIG1 is a flow chart of a method for training an image recognition model provided in an embodiment of the present application;

图2是本申请实施例提供的一种多模态特征融合模块的结构示意图；FIG2 is a schematic diagram of the structure of a multimodal feature fusion module provided in an embodiment of the present application;

图3是本申请实施例提供的一种图像识别方法的流程示意图；FIG3 is a schematic diagram of a flow chart of an image recognition method provided in an embodiment of the present application;

图4是本申请实施例提供的一种图像识别装置的流程示意图；FIG4 is a schematic diagram of a flow chart of an image recognition device provided in an embodiment of the present application;

图5是本申请实施例提供的一种图像识别设备的结构示意图。FIG. 5 is a schematic diagram of the structure of an image recognition device provided in an embodiment of the present application.

具体实施方式Detailed ways

下面将详细描述本申请的各个方面的特征和示例性实施例，为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及具体实施例，对本申请进行进一步详细描述。应理解，此处所描述的具体实施例仅意在解释本申请，而不是限定本申请。对于本领域技术人员来说，本申请可以在不需要这些具体细节中的一些细节的情况下实施。下面对实施例的描述仅仅是为了通过示出本申请的示例来提供对本申请更好的理解。The features and exemplary embodiments of various aspects of the present application will be described in detail below. In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain the present application, rather than to limit the present application. For those skilled in the art, the present application can be implemented without the need for some of these specific details. The following description of the embodiments is only to provide a better understanding of the present application by illustrating the examples of the present application.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "include..." do not exclude the existence of other identical elements in the process, method, article or device including the elements.

图像识别算法是计算机视觉领域的重要研究方向之一，在公共安全、道路交通、视频监控等领域均有着重要的作用。近年来，基于深度学习的图像识别算法的发展，图像识别在准确率方面不断提高。Image recognition algorithm is one of the important research directions in the field of computer vision, and plays an important role in public safety, road traffic, video surveillance, etc. In recent years, with the development of image recognition algorithms based on deep learning, the accuracy of image recognition has been continuously improved.

现有技术中，通过以下两种方式来进行图像识别：In the prior art, image recognition is performed in the following two ways:

一、基于视觉显著性的多视角图像目标检测方法1. Multi-view image object detection method based on visual saliency

针对前景目标未被遮挡的场景，计算多个视角图像的显著性图，利用视角之间的空间关系，将两侧视角的显著性图投影到中间目标视角，并将投影显著性图和中间视角的显著性图相融合得到融合显著性图。被前景物体遮挡的区域在投影时不能真实映射到目标视角，投影显著性图中前景目标周围会产生投影空洞，在融合显著性图中将该投影空洞区域视为背景区域。利用多视角投影空洞划分图像区域，投影空洞和图像边缘之间的区域以及不同前景物体的投影空洞之间的区域均视为背景区域。在融合显著性图中，将以上得到的背景区域的显著性值置为零，并二值化后便可得到边缘清晰、无背景干扰的目标物体。For scenes where the foreground target is not occluded, the saliency maps of multiple view images are calculated. The spatial relationship between the view angles is used to project the saliency maps of the two side view angles to the middle target view angle, and the projected saliency map and the saliency map of the middle view angle are fused to obtain a fused saliency map. The area occluded by the foreground object cannot be truly mapped to the target view angle during projection. A projection hole will be generated around the foreground target in the projected saliency map. The projection hole area is regarded as the background area in the fused saliency map. The image area is divided by the multi-view projection holes. The area between the projection hole and the image edge and the area between the projection holes of different foreground objects are regarded as the background area. In the fused saliency map, the saliency value of the background area obtained above is set to zero, and after binarization, the target object with clear edges and no background interference can be obtained.

二、复杂背景下的小目标检测算法2. Small Target Detection Algorithm under Complex Background

借鉴特征金字塔算法的思想，将Conv4-3层的特征与Conv7、Conv3-3层的特征进行融合，同时增加融合后特征图每个位置对应的默认框数量。在网络结构中增加裁剪-权重分配网络(SENet)，对每层的特征通道进行权重分配，提升有用的特征权重并抑制无效的特征权重。同时为了增强网络的泛化能力，对训练数据集进行一系列增强处理。Drawing on the idea of the feature pyramid algorithm, the features of the Conv4-3 layer are fused with the features of the Conv7 and Conv3-3 layers, and the number of default boxes corresponding to each position of the fused feature map is increased. A clipping-weight distribution network (SENet) is added to the network structure to distribute weights to the feature channels of each layer, increase the weights of useful features and suppress the weights of invalid features. At the same time, in order to enhance the generalization ability of the network, a series of enhancements are performed on the training data set.

上述两种算法都是图像中对目标对象进行检测识别的常用技术，然而由于图像所包含场景的复杂多样性和图像中待检测目标位置的不确定性，常规的目标检测方法在不同的应用场景中具有较差的鲁棒性。上述基于视觉显著性的多视角图像目标检测方法，只考虑了图像中待检测目标的空间关系特征，但未充分利用图像中的多种特征信息进行信息补充以提升最终图像识别的准确率。复杂背景下的小目标检测算法，未考虑复杂背景中的上下文信息和待检测目标的空间关系，应用范围较窄，主要是针对图像中的小目标的检测识别准确率进行改进，忽略了算法在更多复杂场景中的应用。The above two algorithms are both commonly used techniques for detecting and identifying target objects in images. However, due to the complexity and diversity of the scenes contained in the images and the uncertainty of the positions of the targets to be detected in the images, conventional target detection methods have poor robustness in different application scenarios. The above multi-view image target detection method based on visual saliency only considers the spatial relationship characteristics of the targets to be detected in the image, but does not fully utilize the various feature information in the image to supplement the information to improve the accuracy of the final image recognition. The small target detection algorithm under complex background does not consider the contextual information in the complex background and the spatial relationship of the targets to be detected. It has a narrow scope of application and mainly improves the detection and recognition accuracy of small targets in the image, ignoring the application of the algorithm in more complex scenes.

基于此，本申请实施例提供一种图像识别方法，通过特征融合实现图像信息的互补，在避免冗余噪声的同时，弥补了图像特征信息在细节和场景上的不足，同时文本特征的提取，能够反应图像在不同场景下的差异与共性，进而能够适用于更多复杂场景，并提高图像识别的准确率。Based on this, the embodiment of the present application provides an image recognition method, which realizes the complementarity of image information through feature fusion, and while avoiding redundant noise, makes up for the deficiencies of image feature information in details and scenes. At the same time, the extraction of text features can reflect the differences and commonalities of images in different scenes, and thus can be applied to more complex scenes and improve the accuracy of image recognition.

需要说明的是，本申请实施例提供的图像识别方法中，需要利用预先训练好的图像识别模型对图像进行识别，因此，在利用图像识别模型进行图像识别之前，需要先训练好图像识别模型。因此，下面首先结合附图描述本申请实施例提供的图像识别模型的训练方法的具体实施方式。It should be noted that in the image recognition method provided in the embodiment of the present application, it is necessary to use a pre-trained image recognition model to recognize the image. Therefore, before using the image recognition model for image recognition, it is necessary to first train the image recognition model. Therefore, the specific implementation method of the image recognition model training method provided in the embodiment of the present application is first described in conjunction with the accompanying drawings.

如图1所示，本申请实施例提供一种图像识别模型的训练方法，首先获取样本图像，对在样本图像中所提取的池化特征图、文本特征和空间关系特征等信息进行融合，以形成信息更丰富的共享特征图，对预设的图像识别模型通过分类和回归检测算法进行迭代训练，直到满足训练停止条件。上述方法，可以通过以下步骤来实现：As shown in FIG1 , the embodiment of the present application provides a method for training an image recognition model. First, a sample image is obtained, and information such as a pooled feature map, text features, and spatial relationship features extracted from the sample image is fused to form a shared feature map with richer information. The preset image recognition model is iteratively trained through classification and regression detection algorithms until the training stop condition is met. The above method can be implemented by the following steps:

一、获取多张待标注图像。1. Obtain multiple images to be labeled.

在一些实施例中，可以通过车载摄像头获取多张待标注图像或者对获取到的视频进行抽帧处理得到多张待标注图像。In some embodiments, multiple images to be labeled can be acquired through a vehicle-mounted camera or multiple images to be labeled can be obtained by performing frame extraction processing on the acquired video.

二、对上述多张待标注图像进行人工标注，需要标注的内容为目标对象的标签识别信息，标签识别信息包括目标识别对象的分类信息和位置信息，其中位置信息为包围目标对象边界框的坐标值。2. Manually annotate the above-mentioned multiple images to be annotated. The content to be annotated is the label identification information of the target object. The label identification information includes the classification information and position information of the target identification object, wherein the position information is the coordinate value of the boundary box surrounding the target object.

在一些实施例中，车载摄像头拍摄的图像主要以道路交通为主要场景，因此对待标注图像的标注对象可以包括行人、骑手、自行车、摩托车、汽车、卡车、公交车、火车、交通标志和交通灯等目标对象，标注结果为目标对象的类别和包围该目标对象边界框的坐标值；同时，对每张待标注图像从时间、地点、天气三个角度做文本注释。In some embodiments, the images captured by the vehicle-mounted camera mainly feature road traffic as the main scene, so the annotated objects of the images to be annotated may include target objects such as pedestrians, riders, bicycles, motorcycles, cars, trucks, buses, trains, traffic signs and traffic lights, and the annotation results are the category of the target object and the coordinate value of the bounding box surrounding the target object; at the same time, text annotations are made for each image to be annotated from three perspectives: time, location, and weather.

具体地，对于每张待标注图像，从时间角度注释，可选值包括白天、黄昏/黎明、夜晚；从地点角度注释，可选值包括高速公路、城市街道、住宅、停车场、加油站、隧道；从天气角度注释，可选值包括雪、多云、晴、阴、雨、雾。Specifically, for each image to be annotated, from the perspective of time, optional values include daytime, dusk/dawn, and night; from the perspective of location, optional values include highways, city streets, residences, parking lots, gas stations, and tunnels; from the perspective of weather, optional values include snow, cloudy, sunny, overcast, rainy, and foggy.

三、将上述经过人工标注的图像及其每一图像对应的标注信息整合成训练样本集，训练样本集中包括多个样本图像组。3. Integrate the manually annotated images and the annotation information corresponding to each image into a training sample set, which includes multiple sample image groups.

需要说明的是，由于图像识别模型需要进行多次迭代训练，以调整其损失函数值，至损失函数值满足训练停止条件，得到训练后的图像识别模型，而每次迭代训练中，若只输入一张样本图像，样本量太少不利于图像识别模型的训练调整，因此将训练样本集分为多个样本图像组，每一样本图像组中包含多张样本图像，进行利用训练样本集中的多个样本图像组对图像识别模型进行迭代训练。It should be noted that since the image recognition model needs to undergo multiple iterative training to adjust its loss function value until the loss function value meets the training stop condition and obtains the trained image recognition model, and in each iterative training, if only one sample image is input, the sample size is too small, which is not conducive to the training adjustment of the image recognition model. Therefore, the training sample set is divided into multiple sample image groups, each sample image group contains multiple sample images, and the image recognition model is iteratively trained using the multiple sample image groups in the training sample set.

四、利用训练样本集中的样本图像组训练图像识别模型，直至满足训练停止条件，得到训练后的图像识别模型。具体可以有以下步骤：Fourth, use the sample image group in the training sample set to train the image recognition model until the training stop condition is met to obtain the trained image recognition model. Specifically, the following steps may be taken:

4.1、利用预设图像识别模型中的第二网络提取样本图像中可识别对象的样本池化特征图和样本空间关系特征。4.1. Use the second network in the preset image recognition model to extract sample pooling feature maps and sample spatial relationship features of identifiable objects in the sample image.

在一些实施例中，预设图像识别模型中的第二网络可以为快速区域卷积神经网络FasterRCNN网络，本申请对此不做限定。In some embodiments, the second network in the preset image recognition model may be a fast regional convolutional neural network (FasterRCNN) network, which is not limited in this application.

具体地，获取样本图像中可识别对象的样本池化特征图和样本空间关系特征，可以通过以下步骤实现：Specifically, obtaining the sample pooling feature map and sample spatial relationship features of identifiable objects in the sample image can be achieved by the following steps:

4.1.1、将训练集中的样本图像统一调整到1000×600像素的固定大小，得到调整大小后的样本图像。4.1.1. The sample images in the training set are uniformly adjusted to a fixed size of 1000×600 pixels to obtain the resized sample images.

4.1.2、将调整大小后的样本图像组输入深度残差网络ResNet、区域生成网络RPN以及快速区域卷积神经网络提取图像特征，得到池化特征图。4.1.2. Input the resized sample image group into the deep residual network ResNet, the region generation network RPN and the fast regional convolutional neural network to extract image features and obtain the pooled feature map.

1)首先将调整大小后的样本图像输入7×7×64的卷积层conv1，然后依次经过卷积层conv2_x、conv3_x、conv4_x、conv5_x和一个全连接层fc提取样本图像的原始特征图；1) First, the resized sample image is input into the 7×7×64 convolutional layer conv1, and then sequentially passes through the convolutional layers conv2_x, conv3_x, conv4_x, conv5_x and a fully connected layer fc to extract the original feature map of the sample image;

2)将ResNet网络结构中conv4_x输出的原始特征图，输入到区域提取网络RPN中，从中挑选出预测结果中得分最高的前300个锚框(anchor)和与之对应的候选框；2) The original feature map output by conv4_x in the ResNet network structure is input into the region extraction network RPN, and the top 300 anchor boxes with the highest scores in the prediction results and the corresponding candidate boxes are selected;

3)比照conv4_x输出的原特征图，将300候选框的位置映射图输入到快速区域卷积神经网络中的感兴趣区域池化层ROIPooling，得到可识别对象的固定大小的池化特征图。3) Compared with the original feature map output by conv4_x, the position map of the 300 candidate boxes is input into the region of interest pooling layer ROIPooling in the fast regional convolutional neural network to obtain a fixed-size pooling feature map of the identifiable object.

4.1.3、利用300个anchor和与之对应的候选框的坐标，计算候选框之间的交并比(Intersection over union，IOU)，并通过下述公式1计算可识别对象间的空间关系特征，4.1.3. Using the coordinates of the 300 anchors and the corresponding candidate boxes, the intersection over union (IOU) between the candidate boxes is calculated, and the spatial relationship features between the identifiable objects are calculated using the following formula 1:

F_r＝f(w,h,area,d_x,d_y,IOU) 公式1F_r = f (w, h, area, d_x ,_dy , IOU) Formula 1

其中，_w和h代表候选框的宽和高，_area表示候选框面积，d_x和d_y是两候选框几何中心的横向、纵向距离，IOU是指候选框之间的交并比，f(·)表示激活函数，F_r表示预测到的可识别对象之间的空间关系特征。Where_w and h represent the width and height of the candidate box,_area represents the area of the candidate box,_dx and_dy are the horizontal and vertical distances between the geometric centers of the two candidate boxes, IOU refers to the intersection-over-union ratio between the candidate boxes, f(·) represents the activation function, and_Fr represents the predicted spatial relationship features between identifiable objects.

4.2、将样本图像输入至预设图像识别模型中的第一网络，根据样本图像的上下文信息，确定至少一个文本向量，拼接上述至少一个文本向量，确定样本图像对应的样本文本特征F_t。4.2. Input the sample image into the first network in the preset image recognition model, determine at least one text vector according to the context information of the sample image, concatenate the at least one text vector, and determine the sample text feature F_t corresponding to the sample image.

需要说明的是，图像识别模型中的第一网络可以是Word2vec、Glove或BERT等预训练模型；根据样本图像的上下文信息所确定的文本向量，可以是将描述样本图像的时间、地点和天气的文本注释信息转换的词向量，本申请对此均不做限定。It should be noted that the first network in the image recognition model can be a pre-trained model such as Word2vec, Glove or BERT; the text vector determined according to the context information of the sample image can be a word vector converted from text annotation information describing the time, place and weather of the sample image, and this application does not limit this.

4.3、如图2所示，构建多模态特征融合模块，将根据样本图像上下文信息所提取出的样本文本特征、以及基于图像识别模型的第二网络确定的样本空间关系特征和样本池化特征图互补融合得到样本共享特征图像。其融合计算方法，可以通过公式2和公式3实现：4.3. As shown in Figure 2, a multimodal feature fusion module is constructed to complementarily fuse the sample text features extracted from the sample image context information, the sample spatial relationship features determined by the second network based on the image recognition model, and the sample pooling feature map to obtain a sample shared feature image. The fusion calculation method can be implemented by formula 2 and formula 3:

F_v＝ReLu(F_roi,F_r) 公式2F_v = ReLu (F_roi , F_r ) Formula 2

F_out＝F_v*F_t 公式3F_out = F_v * F_t Formula 3

其中，F_roi表示经过池化层ROIPooling后输出的固定尺寸特征图，F_v表示原始特征图，F_out表示样本文本特征、样本空间关系特征和样本池化特征图融合以后得到的样本共享特征图像。Among them, F_roi represents the fixed-size feature map output after the pooling layer ROIPooling, F_v represents the original feature map, and F_out represents the sample shared feature image obtained after the fusion of sample text features, sample spatial relationship features and sample pooling feature maps.

4.4、将样本共享特征图像输入至预设图像识别模型中的第三网络，确定每一可识别对象的参考识别信息，参考识别信息包括可识别对象的分类信息和参考位置信息。4.4. Input the sample shared feature image into the third network in the preset image recognition model to determine the reference recognition information of each identifiable object, where the reference recognition information includes the classification information and reference position information of the identifiable object.

4.5、对每一可识别对象的参考位置信息进行非极大值抑制处理，过滤不符合预设要求的参考位置信息，确定每一样本图像的预测识别信息，预测识别信息包括所有可识别对象的分类信息和预测位置信息。4.5. Perform non-maximum suppression processing on the reference position information of each identifiable object, filter out the reference position information that does not meet the preset requirements, and determine the predicted recognition information of each sample image. The predicted recognition information includes the classification information and predicted position information of all identifiable objects.

在一些实施例中，对每一类可识别对象的参考位置信息，分别进行非极大值抑制处理(Non Maximum Suppression，NMS)，NMS获取按照分数排列的预测列表并对已排序的预测列表进行迭代，丢弃那些IOU值大于预定义阈值的预测结果，此处设置阈值为0.7，过滤掉重叠度较大的候选框，将抑制后的位置信息，确定为预测位置信息。In some embodiments, non-maximum suppression (NMS) is performed on the reference position information of each type of identifiable object. NMS obtains a prediction list arranged by scores and iterates the sorted prediction list to discard prediction results whose IOU values are greater than a predefined threshold. Here, the threshold is set to 0.7 to filter out candidate boxes with large overlaps, and the suppressed position information is determined as the predicted position information.

4.6、计算预测识别信息和标注识别信息之间的损失值，按照公式4所示的目标损失函数优化图像识别模型，利用梯度下降算法反向更新网络参数，得到更新后的图像识别模型，直到损失函数值小于预设值，停止优化训练，确定训练后的图像识别模型。4.6. Calculate the loss value between the predicted recognition information and the labeled recognition information, optimize the image recognition model according to the target loss function shown in Formula 4, and use the gradient descent algorithm to reversely update the network parameters to obtain the updated image recognition model. When the loss function value is less than the preset value, stop the optimization training and determine the trained image recognition model.

其中，i表示anchor的索引，p_i表示第i个anchor预测为目标的概率，表示第i个anchor是否为样本的真实样本标签的概率，λ是表示权重的参数，/>表示两个类别(目标和非目标)的对数损失，/>表示分类损失，t＝{t_x,t_y,t_w,t_h}表示anchor在RPN训练阶段(rois在FastRCNN阶段)预测的偏移量，/>表示anchor在RPN训练阶段(rois在Fast RCNN阶段)相对于真实标签的实际偏移量，/>表示回归损失。Among them, i represents the index of the anchor,_pi represents the probability that the i-th anchor is predicted as the target, Indicates the probability of whether the i-th anchor is the true sample label of the sample, λ is a parameter representing the weight, /> represents the logarithmic loss of two categories (target and non-target),/> represents the classification loss, t = {t_x ,_ty , t_w , t_h } represents the offset predicted by the anchor in the RPN training stage (rois in the FastRCNN stage), /> Indicates the actual offset of the anchor in the RPN training stage (rois in the Fast RCNN stage) relative to the true label, /> represents the regression loss.

需要说明，为了提高图像识别模型的准确度，该图像识别模型还可以在实际应用中不断地利用新的训练样本进行训练，以不断更新图像识别模型，提高图像识别模型的准确度，进而提高图像识别的准确率。It should be noted that in order to improve the accuracy of the image recognition model, the image recognition model can also be continuously trained with new training samples in actual applications to continuously update the image recognition model, improve the accuracy of the image recognition model, and thereby improve the accuracy of image recognition.

以上为本申请实施例提供的图像识别模型训练方法的具体实施方式，经上述训练得到的图像识别模型可应用于如下实施例提供的图像识别方法中。The above is a specific implementation of the image recognition model training method provided in the embodiments of the present application. The image recognition model obtained through the above training can be applied to the image recognition method provided in the following embodiments.

下面结合附图3详细描述本申请提供的图像识别方法的具体实现方式。The specific implementation of the image recognition method provided by the present application is described in detail below with reference to FIG3 .

如图3所示，本申请实施例提供一种图像识别方法，所述方法包括：As shown in FIG3 , an embodiment of the present application provides an image recognition method, the method comprising:

S301，获取待识别图像，待识别图像中有至少一个待识别对象。S301, obtaining an image to be identified, wherein the image to be identified contains at least one object to be identified.

在一些实施例中，待识别对象可以通过车载摄像头来获取，或者对预先获取到的视频进行抽帧处理，确定待识别图像。In some embodiments, the object to be identified can be acquired through a vehicle-mounted camera, or a pre-acquired video can be subjected to frame extraction processing to determine the image to be identified.

以道路交通场景为例，上述待识别图像中的待识别对象可以是行人、骑手、自行车、摩托车、汽车、卡车、公交车、火车、交通标志和交通灯等。Taking a road traffic scene as an example, the objects to be identified in the above-mentioned image to be identified may be pedestrians, riders, bicycles, motorcycles, cars, trucks, buses, trains, traffic signs and traffic lights, etc.

S302，将待识别图像输入至预先训练的图像识别模型中的第一网络，确定待识别图像的文本特征。S302: Input the image to be recognized into the first network of the pre-trained image recognition model to determine the text features of the image to be recognized.

在一些实施例中，将上述待识别图像输入至预先训练的图像识别模型中的第一网络，根据待识别图像的上下文信息，确定至少一个文本向量；拼接上述至少一个文本向量，确定待识别图像的文本特征。In some embodiments, the image to be recognized is input into a first network in a pre-trained image recognition model, and at least one text vector is determined based on context information of the image to be recognized; the at least one text vector is concatenated to determine text features of the image to be recognized.

需要说明的是，上述文本向量是基于第一网络，根据待识别图像的上下文信息，将描述样本图像的时间、地点和天气的文本注释信息转换确定的词向量，因此，通过拼接多个文本向量确定的文本特征，可以表征待识别图像的环境信息，进而反映待识别图像在不同场景下的差异与共性，以增强待识别对象的辨识度。It should be noted that the above-mentioned text vector is based on the first network. According to the context information of the image to be identified, the text annotation information describing the time, place and weather of the sample image is converted into a determined word vector. Therefore, the text features determined by splicing multiple text vectors can represent the environmental information of the image to be identified, and then reflect the differences and commonalities of the image to be identified in different scenes, so as to enhance the recognition of the object to be identified.

S303，将待识别图像输入至图像识别模型中的第二网络，确定至少一个待识别对象的池化特征图和空间关系特征。S303: Input the image to be identified into the second network in the image recognition model to determine a pooled feature map and a spatial relationship feature of at least one object to be identified.

需要说明的是，在对待识别对象进行识别时，由于待识别图像中存在大量冗余信息，因此需要对图像进行卷积处理，在通过卷积处理确定图像特征后，可以用所提取到的图像特征去训练图像识别模型，但是这样计算成本比较高，因此需要对图像进行池化处理，以对图像特征降维，减小计算量和参数个数，同时防止过拟合，提高模型的容错性。It should be noted that when identifying the object to be identified, since there is a large amount of redundant information in the image to be identified, it is necessary to perform convolution processing on the image. After determining the image features through convolution processing, the extracted image features can be used to train the image recognition model, but the computational cost is relatively high. Therefore, it is necessary to perform pooling processing on the image to reduce the dimensionality of the image features, reduce the amount of calculation and the number of parameters, prevent overfitting, and improve the fault tolerance of the model.

另一方面，空间关系是指从图像中分割出来的多个目标对象之间的相对空间位置和相对方向关系，这些关系也可以分为连接关系、交叠/重叠关系和包含/包容关系。因此，空间关系特征的提取，可以增强对图像内容的区分能力。On the other hand, spatial relationships refer to the relative spatial positions and directions between multiple target objects segmented from an image. These relationships can also be divided into connection relationships, overlapping relationships, and inclusion relationships. Therefore, the extraction of spatial relationship features can enhance the ability to distinguish image content.

在一些实施例中，确定待识别对象中至少一个待识别对象的池化特征图和空间关系特征，可以通过以下步骤：In some embodiments, determining the pooled feature map and spatial relationship features of at least one of the objects to be identified may be performed by the following steps:

1、调整样本图像组中每一样本图像的分辨率为预设分辨率，确定调整后的样本图像组。1. Adjust the resolution of each sample image in the sample image group to a preset resolution, and determine the adjusted sample image group.

此步骤中，可以将训练集中的样本图像统一调整到1000×600像素的固定大小。In this step, the sample images in the training set can be uniformly adjusted to a fixed size of 1000×600 pixels.

2、将调整后的样本图像组输入至深度残差网络，确定原始图像集，原始图像集中的图像与调整后的样本图像组中的图像一一对应。2. Input the adjusted sample image group into the deep residual network to determine the original image set, and the images in the original image set correspond one-to-one to the images in the adjusted sample image group.

具体地，可以将调整大小后的样本图像输入7×7×64的卷积层conv1，然后依次经过卷积层conv2_x、conv3_x、conv4_x、conv5_x和一个全连接层fc提取样本图像的原始特征图。Specifically, the resized sample image can be input into the 7×7×64 convolution layer conv1, and then sequentially passes through the convolution layers conv2_x, conv3_x, conv4_x, conv5_x and a fully connected layer fc to extract the original feature map of the sample image.

3、将原始图像集输入至区域提取网络，确定N个锚框及与每一锚框对应的位置坐标，其中，锚框为区域提取网络预测的包围可识别对象的边界框，N为大于1的整数；基于所述N个锚框的置信度，在N个锚框中，提取所述置信度大于预设置信度阈值的M个锚框，其中，M为小于N的正整数。3. Input the original image set into the region extraction network to determine N anchor boxes and the position coordinates corresponding to each anchor box, wherein the anchor box is a bounding box enclosing a recognizable object predicted by the region extraction network, and N is an integer greater than 1; based on the confidence of the N anchor boxes, extract M anchor boxes whose confidence is greater than a preset confidence threshold from the N anchor boxes, wherein M is a positive integer less than N.

作为一个示例，可以将ResNet网络结构中conv4_x输出的原始特征图，输入到区域提取网络RPN中，确定多个锚框和与之对应的候选框，基于每一锚框的置信度，从中挑选出置信度较高的300个锚框和与之对应的候选框。As an example, the original feature map output by conv4_x in the ResNet network structure can be input into the region extraction network RPN to determine multiple anchor boxes and corresponding candidate boxes. Based on the confidence of each anchor box, 300 anchor boxes with higher confidence and their corresponding candidate boxes are selected.

4、将M个锚框的映射区域图像输入至区域卷积神经网络的感兴趣区域池化层，调整M个锚框的映射区域图像的分辨率，确定分辨率相同的M个样本池化特征图，其中，每一可识别对象对应至少一个锚框。4. Input the mapping area images of the M anchor boxes into the region of interest pooling layer of the regional convolutional neural network, adjust the resolution of the mapping area images of the M anchor boxes, and determine M sample pooling feature maps with the same resolution, where each identifiable object corresponds to at least one anchor box.

此步骤中，可以按照conv4_x输出的原特征图，将300候选框的位置映射图输入到快速区域卷积神经网络中的感兴趣区域池化层，得到可识别对象的固定大小的池化特征图。In this step, the position map of the 300 candidate boxes can be input into the region of interest pooling layer in the fast regional convolutional neural network according to the original feature map output by conv4_x to obtain a fixed-size pooling feature map of the identifiable object.

S304，对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行特征融合，确定与待识别图像对应的共享特征图像。S304: Perform feature fusion on the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features to determine a shared feature image corresponding to the image to be recognized.

上述S202及S203中，分别对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行了提取，虽然空间关系特征对图像或者图像中目标对象的旋转、反转、尺寸变化的识别更加敏感、且池化特征图可以减少图像识别中计算量，但在实际应用中，仅仅利用空间关系特征和/或池化特征是不够的，不能有效准确的表达场景信息，因此，需要对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行特征融合，充分利用图像中的多种特征信息进行信息补充，以反映图像在不同场景下的差异与共性，在避免冗余噪声的同时，弥补了图像特征信息在细节和场景上的不足。In the above S202 and S203, the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features are extracted respectively. Although the spatial relationship features are more sensitive to the recognition of rotation, inversion, and size changes of the image or the target object in the image, and the pooled feature map can reduce the amount of calculation in image recognition, in practical applications, it is not enough to only use the spatial relationship features and/or pooled features, and the scene information cannot be effectively and accurately expressed. Therefore, it is necessary to perform feature fusion on the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features, and make full use of the various feature information in the image to supplement the information, so as to reflect the differences and commonalities of the image in different scenes, while avoiding redundant noise, making up for the deficiencies of the image feature information in details and scenes.

S305，将共享特征图像输入至图像识别模型中的第三网络，确定待识别图像的识别信息，识别信息包括每一待识别对象的类别信息和位置信息。S305, inputting the shared feature image into the third network in the image recognition model to determine the recognition information of the image to be recognized, where the recognition information includes the category information and location information of each object to be recognized.

本申请实施例提供的图像识别方法，通过图像识别模型确定待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征。将多种特征信息互补融合，增强了图像中待识别对象的辨识度，从而优化最终的图像识别性能，适用于更多复杂场景，并提高图像识别的准确率。The image recognition method provided in the embodiment of the present application determines the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features through the image recognition model. The multiple feature information is complementary and integrated to enhance the recognition of the object to be recognized in the image, thereby optimizing the final image recognition performance, being applicable to more complex scenes, and improving the accuracy of image recognition.

为了验证上述实施例中提供的图像识别方法相比于现有技术中的图像识别方法，能够提高图像识别的准确度，本申请实施例还提供一种图像识别的测试方法，对本申请图像识别方法中所应用的图像识别模型进行测试。具体地，可以包括以下步骤：In order to verify that the image recognition method provided in the above embodiment can improve the accuracy of image recognition compared with the image recognition method in the prior art, the embodiment of the present application also provides a test method for image recognition, which tests the image recognition model used in the image recognition method of the present application. Specifically, the following steps may be included:

一、将样本图像输入训练好的图像识别模型进行测试。1. Input the sample image into the trained image recognition model for testing.

具体地，按照公式5和公式6计算所有类别目标对象的平均检测精度，输出各预测框的分类和预测精度：Specifically, the average detection accuracy of all categories of target objects is calculated according to Formula 5 and Formula 6, and the classification and prediction accuracy of each prediction box are output:

其中，N表示待检测目标类别数，AP表示平均精度，mAP表示所有类别的平均精度均值。Among them, N represents the number of target categories to be detected, AP represents the average precision, and mAP represents the mean average precision of all categories.

二、根据上述AP和mAP计算公式得出检测结果，比较现有技术中利用Faster RCNN网络算法和利用本申请实施例提供的图像识别模型的图像识别算法的优劣，得出结论：2. The detection results are obtained according to the above AP and mAP calculation formulas, and the advantages and disadvantages of the image recognition algorithm using the Faster RCNN network algorithm in the prior art and the image recognition algorithm using the image recognition model provided by the embodiment of the present application are compared, and the conclusion is drawn:

将本申请实施例提供的图像识别方法用于经典的图像识别网络中，对图像识别效果有显著改进，即使在图像的背景差异较大的情况下，图像中目标对象的识别精度仍维持在较稳定的水平且相较于原有的算法有更佳的识别效果。The image recognition method provided in the embodiment of the present application is used in a classic image recognition network, which significantly improves the image recognition effect. Even when the background difference of the image is large, the recognition accuracy of the target object in the image is still maintained at a relatively stable level and has a better recognition effect than the original algorithm.

具体的，以一个施例具体对本申请实施例提供的图像识别模型的测试方法，通过以下仿真实验做进一步说明。Specifically, the testing method of the image recognition model provided in the embodiment of the present application is further illustrated by the following simulation experiment using an example.

本申请提供的仿真实验中所采用的现有技术为更快速的区域卷积神经网络Faster RCNN；图像识别模型选用ResNet101结构提取图像特征，设置初始学习率为0.005，学习率衰减系数为0.1，epoch设为15，默认优化器选择SGD。The existing technology used in the simulation experiment provided in this application is the faster regional convolutional neural network Faster RCNN; the image recognition model uses the ResNet101 structure to extract image features, sets the initial learning rate to 0.005, the learning rate decay coefficient to 0.1, the epoch to 15, and the default optimizer to select SGD.

1、仿真条件：本申请仿真的硬件环境是：Intel Core i7-7700@1. Simulation conditions: The hardware environment simulated in this application is: Intel Core i7-7700@

3.60GHz,8G内存；软件环境：ubuntu 16.04，python3.7，pycharm2019。3.60GHz, 8G memory; software environment: ubuntu 16.04, python3.7, pycharm2019.

2、仿真内容与结果分析：2. Simulation content and result analysis:

首先将样本图像集作为输入，在传统的Faster RCNN算法基础上引入基于上下文信息进行文本特征提取、空间关系特征提取和池化特征图获取，然后将上述三种特征融合检测方法的基本思路，借助此方法训练图像识别模型，将测试样本集输入至训练好的改进模型，以AP指标评估各类别的平均精度和所有类别的平均精度。Firstly, the sample image set is taken as input. On the basis of the traditional Faster RCNN algorithm, text feature extraction, spatial relationship feature extraction and pooling feature map acquisition based on context information are introduced. Then, the basic idea of the above three feature fusion detection methods is used to train the image recognition model. The test sample set is input into the trained improved model, and the average precision of each category and the average precision of all categories are evaluated with the AP indicator.

本申请基于BDD100k公开驾驶数据集进行实验，仿真实验结果如表1所示，表中显示了经典的FasterRCNN算法和基于上下文信息的多模态特征融合检测方法在相同数据集上测试的对比结果。This application conducts experiments based on the BDD100k public driving dataset. The simulation experiment results are shown in Table 1, which shows the comparative results of the classic FasterRCNN algorithm and the multimodal feature fusion detection method based on context information tested on the same dataset.

表1图像识别方法的性能对比Table 1 Performance comparison of image recognition methods

从表1的实验结果可以看出，与经典的Faster RCNN算法在测试数据集上的检测精度相比，本申请实施例提供的图像识别方法在不同场景的任务中在其中五个类别目标的平均检测精度上提升近4.3％。多次实验证明：多模态特征融合技术利用信息间的互补性增强了输入特征的表示性，能够有效改善目标检测算法的性能，在不同图像识别场景中的大部分类别中的平均精度明显提升。由于在现实生活场景中，图像/视频数据获取难度高且经常出现缺失，此时并不适用传统的基于图像和视频的目标检测方法，而本申请实施例提供的图像识别方法能够增强信息间的互补性，对不同场景中的检测任务有着重要意义。It can be seen from the experimental results in Table 1 that compared with the detection accuracy of the classic Faster RCNN algorithm on the test data set, the image recognition method provided by the embodiment of the present application improves the average detection accuracy of five categories of targets in tasks in different scenes by nearly 4.3%. Multiple experiments have proved that the multimodal feature fusion technology uses the complementarity between information to enhance the representation of input features, which can effectively improve the performance of the target detection algorithm, and the average accuracy in most categories in different image recognition scenarios is significantly improved. Since image/video data is difficult to obtain and often missing in real-life scenarios, traditional image and video-based target detection methods are not applicable at this time. The image recognition method provided by the embodiment of the present application can enhance the complementarity between information, which is of great significance to detection tasks in different scenes.

基于上述图像识别方法的相同发明构思，本申请实施例还提供一种图像识别装置。Based on the same inventive concept of the above-mentioned image recognition method, an embodiment of the present application also provides an image recognition device.

如图4所示，本申请实施例提供一种图像识别装置，可以包括：As shown in FIG4 , an embodiment of the present application provides an image recognition device, which may include:

第一获取模块401，用于获取待识别图像，待识别图像中有至少一个待识别对象；The first acquisition module 401 is used to acquire an image to be identified, wherein the image to be identified contains at least one object to be identified;

第一确定模块402，用于将待识别图像输入至预先训练的图像识别模型中的第一网络，确定待识别图像的文本特征；A first determination module 402 is used to input the image to be recognized into a first network in a pre-trained image recognition model to determine text features of the image to be recognized;

第二确定模块403，用于将待识别图像输入至图像识别模型中的第二网络，确定至少一个待识别对象的池化特征图和空间关系特征；A second determination module 403 is used to input the image to be identified into a second network in the image recognition model to determine a pooling feature map and a spatial relationship feature of at least one object to be identified;

融合模块404，用于对待识别图像的文本特征、至少一个待识别对象的池化特征图和空间关系特征进行特征融合，确定与待识别图像对应的共享特征图像；A fusion module 404 is used to perform feature fusion on the text features of the image to be recognized, the pooled feature map of at least one object to be recognized, and the spatial relationship features to determine a shared feature image corresponding to the image to be recognized;

识别模块405，用于将共享特征图像输入至图像识别模型中的第三网络，确定待识别图像的识别信息，识别信息包括每一待识别对象的类别信息和位置信息。The recognition module 405 is used to input the shared feature image into the third network in the image recognition model to determine the recognition information of the image to be recognized, and the recognition information includes the category information and location information of each object to be recognized.

在一些实施例中，装置还可以包括：In some embodiments, the apparatus may further include:

第二获取模块，用于获取训练样本集，训练样本集中包括多个样本图像组，每一样本图像组包括样本图像及其对应的标签图像，标签图像中标注有目标识别对象的标签识别信息以及样本图像的场景信息，标签识别信息包括目标识别对象的类别信息和位置信息；A second acquisition module is used to acquire a training sample set, the training sample set includes a plurality of sample image groups, each sample image group includes a sample image and its corresponding label image, the label image is annotated with label identification information of the target identification object and scene information of the sample image, the label identification information includes category information and location information of the target identification object;

训练模块，用于利用训练样本集中的样本图像组训练预设的图像识别模型，直至满足训练停止条件，得到训练后的图像识别模型。The training module is used to train a preset image recognition model using a sample image group in a training sample set until a training stop condition is met to obtain a trained image recognition model.

在一些实施例中，训练模块具体可以用于：In some embodiments, the training module may be specifically used to:

对每个样本图像组，分别执行以下步骤：For each sample image group, perform the following steps:

将样本图像组输入至预设图像识别模型中的第一网络，确定与每一样本图像对应的样本文本特征；Inputting the sample image group into the first network in the preset image recognition model to determine the sample text features corresponding to each sample image;

将样本图像组输入至预设图像识别模型中的第二网络，确定每一可识别对象的样本池化特征图和样本空间关系特征；Inputting the sample image group into the second network in the preset image recognition model to determine the sample pooling feature map and sample spatial relationship features of each identifiable object;

根据与每一样本图像对应的样本文本特征、每一可识别对象的样本池化特征图和样本空间关系特征，对每一样本图像进行特征融合，确定与每一样本图像对应的样本共享特征图像；According to the sample text features corresponding to each sample image, the sample pooling feature map of each identifiable object and the sample spatial relationship features, feature fusion is performed on each sample image to determine the sample shared feature image corresponding to each sample image;

将样本共享特征图像输入至预设图像识别模型中的第三网络，确定每一可识别对象的参考识别信息，参考识别信息包括可识别对象的分类信息和参考位置信息；Inputting the sample shared feature image into a third network in a preset image recognition model to determine reference recognition information of each identifiable object, the reference recognition information including classification information and reference position information of the identifiable object;

对每一可识别对象的参考位置信息进行非极大值抑制处理，过滤不符合预设要求的参考位置信息，确定每一样本图像的预测识别信息，预测识别信息包括所有可识别对象的分类信息和预测位置信息；Perform non-maximum suppression processing on the reference position information of each identifiable object, filter out the reference position information that does not meet the preset requirements, and determine the predicted recognition information of each sample image, where the predicted recognition information includes the classification information and predicted position information of all identifiable objects;

根据目标样本图像的预测识别信息和目标样本图像上所有目标识别对象的标签识别信息，确定预设图像识别模型的损失函数值，目标样本图像是样本图像组中的任一个；Determining a loss function value of a preset image recognition model according to predicted recognition information of a target sample image and label recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of the sample image group;

在损失函数值不满足训练停止条件的情况下，调整图像识别模型的模型参数，并利用样本图像组训练参数调整后的图像识别模型，直至损失函数值满足训练停止条件，得到训练后的图像识别模型。When the loss function value does not meet the training stop condition, the model parameters of the image recognition model are adjusted, and the image recognition model after the parameter adjustment is trained using the sample image group until the loss function value meets the training stop condition, thereby obtaining the trained image recognition model.

对每一样本图像，分别执行以下步骤：For each sample image, perform the following steps:

将样本图像输入至预设图像识别模型中的第一网络，根据样本图像的上下文信息，确定至少一个文本向量；Inputting the sample image into a first network in a preset image recognition model, and determining at least one text vector according to context information of the sample image;

拼接至少一个文本向量，确定与样本图像对应的样本文本特征。At least one text vector is concatenated to determine a sample text feature corresponding to the sample image.

在一些实施例中，预设图像识别模型中的第二网络至少包括深度残差网络、区域提取网络和区域卷积神经网络，In some embodiments, the second network in the preset image recognition model includes at least a deep residual network, a region extraction network and a regional convolutional neural network.

训练模块具体可以用于：The training module can be used for:

调整样本图像组中每一样本图像的分辨率为预设分辨率，确定调整后的样本图像组；Adjusting the resolution of each sample image in the sample image group to a preset resolution, and determining an adjusted sample image group;

将调整后的样本图像组输入至深度残差网络，确定原始图像集，原始图像集中的图像与调整后的样本图像组中的图像一一对应；Inputting the adjusted sample image group into the deep residual network to determine the original image set, where the images in the original image set correspond one-to-one to the images in the adjusted sample image group;

将原始图像集输入至区域提取网络，确定N个锚框及与每一锚框对应的位置坐标，其中，锚框为区域提取网络预测的包围可识别对象的边界框，N为大于1的整数；Input the original image set into the region extraction network, determine N anchor boxes and the position coordinates corresponding to each anchor box, where the anchor box is a bounding box enclosing a recognizable object predicted by the region extraction network, and N is an integer greater than 1;

基于N个锚框的置信度，在N个锚框中，提取置信度大于预设置信度阈值的M个锚框，其中，M为小于N的正整数；Based on the confidences of the N anchor frames, M anchor frames whose confidences are greater than a preset confidence threshold are extracted from the N anchor frames, where M is a positive integer less than N;

将M个锚框的映射区域图像输入至区域卷积神经网络的感兴趣区域池化层，调整M个锚框的映射区域图像的分辨率，确定分辨率相同的M个样本池化特征图，其中，每一可识别对象对应至少一个锚框；Input the mapping area images of the M anchor boxes into the region of interest pooling layer of the regional convolutional neural network, adjust the resolution of the mapping area images of the M anchor boxes, and determine M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor box;

根据每一可识别对象对应的至少一个锚框之间的交并比和相对位置，确定每一可识别对象的样本空间关系特征。The sample spatial relationship feature of each identifiable object is determined according to the intersection-over-union ratio and the relative position between at least one anchor frame corresponding to each identifiable object.

基于每一可识别对象的分类信息，将所有可识别对象分为多组，确定多组不同类别的可识别对象的参考位置信息；Based on the classification information of each identifiable object, all identifiable objects are divided into multiple groups, and reference position information of multiple groups of identifiable objects of different categories is determined;

对每一类可识别对象的参考位置信息，进行过滤处理；Filtering the reference position information of each type of identifiable object;

根据过滤之后的可识别对象的参考位置信息和过滤之后的可识别对象的分类信息，确定每一样本图像的预测识别信息。The predicted recognition information of each sample image is determined according to the reference position information of the identifiable objects after filtering and the classification information of the identifiable objects after filtering.

依次计算目标框与其他参考框之间的交并比，目标框为多个参考框中的任一个，参考框是参考位置信息中所确定的包围可识别对象的边界框；The intersection-over-union ratios between the target frame and other reference frames are calculated in sequence, where the target frame is any one of the multiple reference frames, and the reference frame is a bounding box surrounding the identifiable object determined in the reference position information;

将交并比大于预设交并比阈值的参考框过滤，直到任意两个参考框之间的交并比均小于预设交并比阈值；Filter the reference frames whose IoU ratio is greater than the preset IoU ratio threshold until the IoU ratio between any two reference frames is less than the preset IoU ratio threshold;

将过滤之后的参考框确定为可识别对象的预测位置信息。The filtered reference frame is determined as the predicted position information of the identifiable object.

根据本申请实施例提供的图像识别装置的其他细节与以上结合图1描述的根据本申请实施例的图像识别方法类似，在此不再赘述。Other details of the image recognition device provided according to the embodiment of the present application are similar to the image recognition method according to the embodiment of the present application described above in conjunction with Figure 1, and will not be repeated here.

图5示出了本申请实施例提供的图像识别的硬件结构示意图。FIG5 shows a schematic diagram of the hardware structure of image recognition provided in an embodiment of the present application.

结合图1、图4描述的根据本申请实施例提供的图像识别方法和装置可以由图像识别设备来实现。图5是示出根据发明实施例的图像识别设备的硬件结构500示意图。The image recognition method and apparatus according to the embodiment of the present application described in conjunction with Figures 1 and 4 can be implemented by an image recognition device. Figure 5 is a schematic diagram showing a hardware structure 500 of an image recognition device according to an embodiment of the invention.

在图像识别设备中可以包括处理器501以及存储有计算机程序指令的存储器502。The image recognition device may include a processor 501 and a memory 502 storing computer program instructions.

具体地，上述处理器501可以包括中央处理器(Central Processing Unit，CPU)，或者特定集成电路(Application Specific Integrated Circuit，ASIC)，或者可以被配置成实施本申请实施例的一个或多个集成电路。Specifically, the processor 501 may include a central processing unit (CPU), or an application specific integrated circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

存储器502可以包括用于数据或指令的大容量存储器。举例来说而非限制，存储器502可包括硬盘驱动器(Hard Disk Drive，HDD)、软盘驱动器、闪存、光盘、磁光盘、磁带或通用串行总线(Universal Serial Bus，USB)驱动器或者两个或更多个以上这些的组合。在一个实例中，存储器502可以包括可移除或不可移除(或固定)的介质，或者存储器502是非易失性固态存储器。存储器502可在综合网关容灾设备的内部或外部。The memory 502 may include a large capacity memory for data or instructions. By way of example and not limitation, the memory 502 may include a hard disk drive (HDD), a floppy disk drive, a flash memory, an optical disk, a magneto-optical disk, a magnetic tape, or a universal serial bus (USB) drive or a combination of two or more of these. In one example, the memory 502 may include a removable or non-removable (or fixed) medium, or the memory 502 is a non-volatile solid-state memory. The memory 502 may be inside or outside the integrated gateway disaster recovery device.

在一个实例中，存储器502可以是只读存储器(Read Only Memory，ROM)。在一个实例中，该ROM可以是掩模编程的ROM、可编程ROM(PROM)、可擦除PROM(EPROM)、电可擦除PROM(EEPROM)、电可改写ROM(EAROM)或闪存或者两个或更多个以上这些的组合。In one example, the memory 502 may be a read-only memory (ROM). In one example, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or a flash memory, or a combination of two or more of these.

处理器501通过读取并执行存储器502中存储的计算机程序指令，以实现图3所示实施例中的方法/步骤S301至S305，并达到图3所示实例执行其方法/步骤达到的相应技术效果，为简洁描述在此不再赘述。The processor 501 implements the method/steps S301 to S305 in the embodiment shown in Figure 3 by reading and executing the computer program instructions stored in the memory 502, and achieves the corresponding technical effects achieved by executing the method/steps in the example shown in Figure 3, which will not be repeated here for the sake of brevity.

在一个示例中，图像识别设备还可包括通信接口503和总线510。其中，如图5所示，处理器501、存储器502、通信接口503通过总线510连接并完成相互间的通信。In one example, the image recognition device may further include a communication interface 503 and a bus 510. As shown in Fig. 5, the processor 501, the memory 502, and the communication interface 503 are connected via the bus 510 and communicate with each other.

通信接口503，主要用于实现本申请实施例中各模块、装置、单元和/或设备之间的通信。The communication interface 503 is mainly used to implement communication between various modules, devices, units and/or equipment in the embodiments of the present application.

总线510包括硬件、软件或两者，将在线数据流量计费设备的部件彼此耦接在一起。举例来说而非限制，总线可包括加速图形端口(Accelerated Graphics Port，AGP)或其他图形总线、增强工业标准架构(Extended Industry Standard Architecture，EISA)总线、前端总线(Front Side Bus，FSB)、超传输(Hyper Transport，HT)互连、工业标准架构(Industry Standard Architecture，ISA)总线、无限带宽互连、低引脚数(LPC)总线、存储器总线、微信道架构(MCA)总线、外围组件互连(PCI)总线、PCI-Express(PCI-X)总线、串行高级技术附件(SATA)总线、视频电子标准协会局部(VLB)总线或其他合适的总线或者两个或更多个以上这些的组合。在合适的情况下，总线510可包括一个或多个总线。尽管本申请实施例描述和示出了特定的总线，但本申请考虑任何合适的总线或互连。Bus 510 includes hardware, software or both, and the components of online data flow billing equipment are coupled to each other. For example, but not limitation, bus may include accelerated graphics port (Accelerated Graphics Port, AGP) or other graphics bus, enhanced industry standard architecture (Extended Industry Standard Architecture, EISA) bus, front side bus (Front Side Bus, FSB), Hyper Transport (Hyper Transport, HT) interconnection, industry standard architecture (Industry Standard Architecture, ISA) bus, infinite bandwidth interconnection, low pin count (LPC) bus, memory bus, micro channel architecture (MCA) bus, peripheral component interconnection (PCI) bus, PCI-Express (PCI-X) bus, serial advanced technology attachment (SATA) bus, video electronics standard association local (VLB) bus or other suitable bus or two or more of these combinations. In appropriate cases, bus 510 may include one or more buses. Although the present application embodiment describes and shows a specific bus, the present application considers any suitable bus or interconnection.

本申请实施例提供的图像识别设备，通过特征融合实现图像信息的互补，在避免冗余噪声的同时，弥补了图像特征信息在细节和场景上的不足，充分利用图像中的多种特征信息进行信息补充，同时文本特征的提取，能够反应图像在不同场景下的差异与共性，进而能够适用于更多复杂场景，并提高图像识别的准确率。The image recognition device provided in the embodiment of the present application realizes the complementation of image information through feature fusion, and while avoiding redundant noise, it makes up for the deficiencies of image feature information in details and scenes, and makes full use of various feature information in the image for information supplementation. At the same time, the extraction of text features can reflect the differences and commonalities of images in different scenes, and thus can be applied to more complex scenes and improve the accuracy of image recognition.

另外，结合上述实施例中的图像识别方法，本申请实施例可提供一种计算机存储介质来实现。该计算机存储介质上存储有计算机程序指令；该计算机程序指令被处理器执行时实现上述实施例中的任意一种图像识别方法。In addition, in combination with the image recognition method in the above embodiments, the present application embodiment can provide a computer storage medium for implementation. The computer storage medium stores computer program instructions; when the computer program instructions are executed by a processor, any one of the image recognition methods in the above embodiments is implemented.

需要明确的是，本申请并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见，这里省略了对已知方法的详细描述。在上述实施例中，描述和示出了若干具体的步骤作为示例。但是，本申请的方法过程并不限于所描述和示出的具体步骤，本领域的技术人员可以在领会本申请的精神后，作出各种改变、修改和添加，或者改变步骤之间的顺序。It should be clear that the present application is not limited to the specific configuration and processing described above and shown in the figures. For the sake of simplicity, a detailed description of the known method is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present application is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between the steps after understanding the spirit of the present application.

以上所述的结构框图中所示的功能块可以实现为硬件、软件、固件或者它们的组合。当以硬件方式实现时，其可以例如是电子电路、专用集成电路(Application SpecificIntegrated Circuit，ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时，本申请的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中，或者通过载波中携带的数据信号在传输介质或者通信链路上传送。“机器可读介质”可以包括能够存储或传输信息的任何介质。机器可读介质的例子包括电子电路、半导体存储器设备、ROM、闪存、可擦除ROM(EROM)、软盘、CD-ROM、光盘、硬盘、光纤介质、射频(RadioFrequency，RF)链路，等等。代码段可以经由诸如因特网、内联网等的计算机网络被下载。The functional blocks shown in the above-described block diagram can be implemented as hardware, software, firmware or a combination thereof. When implemented in hardware, it can be, for example, an electronic circuit, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), appropriate firmware, plug-in, function card, etc. When implemented in software, the elements of the present application are programs or code segments used to perform the required tasks. The program or code segment can be stored in a machine-readable medium, or transmitted on a transmission medium or communication link by a data signal carried in a carrier. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, optical fiber media, radio frequency (Radio Frequency, RF) links, etc. The code segment can be downloaded via a computer network such as the Internet, an intranet, etc.

还需要说明的是，本申请中提及的示例性实施例，基于一系列的步骤或者装置描述一些方法或系统。但是，本申请不局限于上述步骤的顺序，也就是说，可以按照实施例中提及的顺序执行步骤，也可以不同于实施例中的顺序，或者若干步骤同时执行。It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, this application is not limited to the order of the above steps, that is, the steps can be performed in the order mentioned in the embodiments, or in a different order from the embodiments, or several steps can be performed simultaneously.

上面参考根据本公开的实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各方面。应当理解，流程图和/或框图中的每个方框以及流程图和/或框图中各方框的组合可以由计算机程序指令实现。这些计算机程序指令可被提供给通用计算机、专用计算机、或其它可编程数据处理装置的处理器，以产生一种机器，使得经由计算机或其它可编程数据处理装置的处理器执行的这些指令使能对流程图和/或框图的一个或多个方框中指定的功能/动作的实现。这种处理器可以是但不限于是通用处理器、专用处理器、特殊应用处理器或者现场可编程逻辑电路。还可理解，框图和/或流程图中的每个方框以及框图和/或流程图中的方框的组合，也可以由执行指定的功能或动作的专用硬件来实现，或可由专用硬件和计算机指令的组合来实现。The above reference is to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present disclosure, and describes various aspects of the present application. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device to produce a machine so that these instructions executed by the processor of the computer or other programmable data processing device enable the implementation of the function/action specified in one or more boxes of the flowchart and/or block diagram. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field programmable logic circuit. It can also be understood that each box in the block diagram and/or flowchart and the combination of boxes in the block diagram and/or flowchart can also be implemented by dedicated hardware that performs a specified function or action, or can be implemented by a combination of dedicated hardware and computer instructions.

以上所述，仅为本申请的具体实施方式，所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的系统、模块和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。应理解，本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。The above is only a specific implementation of the present application. Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, modules and units described above can refer to the corresponding processes in the aforementioned method embodiments, and will not be repeated here. It should be understood that the protection scope of the present application is not limited to this. Any technician familiar with the technical field can easily think of various equivalent modifications or replacements within the technical scope disclosed in this application, and these modifications or replacements should be included in the protection scope of this application.