CN118629042A

Movatterモバイル変換

Info

Publication number: CN118629042A
Application number: CN202411111303.0A
Authority: CN
Inventors: 徐正斐; 刘庆斌; 李丽丽; 郝彦超; 李博; 陈曦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-08-14
Filing date: 2024-08-14
Publication date: 2024-09-10
Anticipated expiration: 2044-08-14
Also published as: CN118629042B

Abstract

Translated fromChinese

本公开实施例公开了一种标识生成方法、装置、电子设备及存储介质，该方法包括：获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，其中，查询掩模用于指示多个候选对象中被选择的目标对象；对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征；将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征；基于目标拼接特征进行文本预测，生成目标对象的目标实体标识；本公开实施例能够提高目标实体标识的预测准确率。

The embodiments of the present disclosure disclose a method, device, electronic device and storage medium for generating an identifier, the method comprising: obtaining local masks corresponding to each candidate object in a target image, determining a query mask from multiple local masks, wherein the query mask is used to indicate a target object selected from multiple candidate objects; performing feature extraction based on the query mask and each local mask on the target image to obtain multiple mask visual features, extracting mask position features corresponding to the query mask and each local mask; splicing each mask visual feature with the corresponding mask position feature to obtain multiple regional features; extracting a first image feature of the target image, splicing the first image feature with multiple regional features to obtain a target splicing feature; performing text prediction based on the target splicing feature to generate a target entity identifier of the target object; the embodiments of the present disclosure can improve the prediction accuracy of the target entity identifier.

Description

Translated fromChinese

标识生成方法、装置、电子设备及存储介质Identification generation method, device, electronic device and storage medium

技术领域Technical Field

本公开涉及计算机技术领域，特别是涉及一种标识生成方法、装置、电子设备及存储介质。The present disclosure relates to the field of computer technology, and in particular to an identification generation method, device, electronic device and storage medium.

背景技术Background Art

视觉实体链接是指将图像中的对象与知识库中的相应实体进行匹配。在相关技术中，通常通过将待识别图像编码为全局图像特征，同时将描述待识别图像中的目标对象的查询文本作为视觉提示，然后基于全局图像特征和查询文本的文本特征预测出图像对应的实体标识。然而，全局图像特征通常会忽略图像中的局部细节，导致信息丢失，从而降低实体标识的预测准确率。Visual entity linking refers to matching objects in an image with corresponding entities in a knowledge base. In related technologies, the image to be identified is usually encoded as a global image feature, and the query text describing the target object in the image to be identified is used as a visual cue. Then, the entity identifier corresponding to the image is predicted based on the global image feature and the text features of the query text. However, global image features usually ignore local details in the image, resulting in information loss, thereby reducing the prediction accuracy of entity identifiers.

发明内容Summary of the invention

以下是对本公开详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is a summary of the subject matter of the detailed description of the present disclosure. This summary is not intended to limit the scope of the claims.

本公开实施例提供了一种标识生成方法、装置、电子设备及存储介质，能够提高目标实体标识的预测准确率。The embodiments of the present disclosure provide a method, device, electronic device and storage medium for generating an identifier, which can improve the prediction accuracy of a target entity identifier.

一方面，本公开实施例提供了一种标识生成方法，包括：In one aspect, an embodiment of the present disclosure provides a method for generating an identifier, comprising:

获取目标图像中各个候选对象对应的局部掩模，在多个所述局部掩模中确定查询掩模，其中，所述查询掩模用于指示多个所述候选对象中被选择的目标对象；Acquire a local mask corresponding to each candidate object in the target image, and determine a query mask from the multiple local masks, wherein the query mask is used to indicate a target object selected from the multiple candidate objects;

对所述目标图像分别进行基于所述查询掩模以及各个所述局部掩模的特征提取，得到多个掩模视觉特征，提取所述查询掩模以及各个所述局部掩模对应的掩模位置特征；Performing feature extraction based on the query mask and each of the local masks on the target image to obtain a plurality of mask visual features, and extracting mask position features corresponding to the query mask and each of the local masks;

将各个所述掩模视觉特征分别与对应的所述掩模位置特征进行拼接，得到多个区域特征；splicing each of the mask visual features with the corresponding mask position features to obtain a plurality of regional features;

提取所述目标图像的第一图像特征，将所述第一图像特征以及多个所述区域特征进行拼接，得到目标拼接特征；Extracting a first image feature of the target image, and splicing the first image feature and a plurality of the regional features to obtain a target splicing feature;

基于所述目标拼接特征进行文本预测，生成所述目标对象的目标实体标识。Text prediction is performed based on the target concatenation features to generate a target entity identifier of the target object.

另一方面，本公开实施例还提供了一种标识生成装置，包括：On the other hand, the present disclosure also provides an identification generating device, including:

获取模块，用于获取目标图像中各个候选对象对应的局部掩模，在多个所述局部掩模中确定查询掩模，其中，所述查询掩模用于指示多个所述候选对象中被选择的目标对象；An acquisition module, used for acquiring local masks corresponding to each candidate object in the target image, and determining a query mask from a plurality of the local masks, wherein the query mask is used for indicating a target object selected from the plurality of the candidate objects;

特征提取模块，用于对所述目标图像分别进行基于所述查询掩模以及各个所述局部掩模的特征提取，得到多个掩模视觉特征，提取所述查询掩模以及各个所述局部掩模对应的掩模位置特征；A feature extraction module, used to perform feature extraction on the target image based on the query mask and each of the local masks, obtain a plurality of mask visual features, and extract mask position features corresponding to the query mask and each of the local masks;

第一拼接模块，用于将各个所述掩模视觉特征分别与对应的所述掩模位置特征进行拼接，得到多个区域特征；A first splicing module, used for splicing each of the mask visual features with the corresponding mask position features to obtain a plurality of regional features;

第二拼接模块，用于提取所述目标图像的第一图像特征，将所述第一图像特征以及多个所述区域特征进行拼接，得到目标拼接特征；A second stitching module is used to extract a first image feature of the target image, and stitch the first image feature and a plurality of the regional features to obtain a target stitching feature;

生成模块，用于基于所述目标拼接特征进行文本预测，生成所述目标对象的目标实体标识。A generation module is used to perform text prediction based on the target splicing features to generate a target entity identifier of the target object.

进一步，上述第二拼接模块具体用于：Furthermore, the second splicing module is specifically used for:

分别确定各个所述局部掩模的掩模面积，按照所述掩模面积的大小顺序，将各个所述局部掩模对应的所述区域特征进行拼接，得到第一拼接特征；Determine the mask area of each of the local masks respectively, and splice the regional features corresponding to each of the local masks in order of the size of the mask area to obtain a first splicing feature;

将所述第一图像特征、所述第一拼接特征以及所述查询掩模对应的所述区域特征进行拼接，得到目标拼接特征。The first image feature, the first stitching feature, and the region feature corresponding to the query mask are stitched together to obtain a target stitching feature.

构建用于提示所述第一大语言模型生成实体标识的提示文本；Constructing a prompt text for prompting the first language model to generate an entity identifier;

提取所述提示文本的文本特征，将所述第一图像特征、所述文本特征、所述第一拼接特征以及所述查询掩模对应的所述区域特征进行拼接，得到目标拼接特征。The text feature of the prompt text is extracted, and the first image feature, the text feature, the first splicing feature and the region feature corresponding to the query mask are spliced to obtain a target splicing feature.

进一步，上述特征提取模块具体用于：Furthermore, the feature extraction module is specifically used for:

对所述目标图像进行多层级特征提取，得到所述目标图像的多层级视觉特征；Performing multi-level feature extraction on the target image to obtain multi-level visual features of the target image;

分别基于所述查询掩模以及各个所述局部掩模，对所述多层级视觉特征进行掩模池化，得到多个多层级池化特征；Based on the query mask and each of the local masks, respectively, mask pooling is performed on the multi-level visual features to obtain a plurality of multi-level pooled features;

分别对各个所述多层级池化特征进行特征融合，得到多个掩模视觉特征。Feature fusion is performed on each of the multi-level pooling features to obtain multiple mask visual features.

对于任意一个所述多层级池化特征，分别对所述多层级池化特征中各个层级的子特征进行映射，得到多个维度相同的中间特征，将各个所述中间特征进行特征融合，得到融合特征；For any of the multi-level pooling features, sub-features of each level in the multi-level pooling features are mapped respectively to obtain multiple intermediate features of the same dimension, and each of the intermediate features is subjected to feature fusion to obtain a fused feature;

分别对各个所述融合特征进行多层感知处理，得到多个掩模视觉特征。Multi-layer perception processing is performed on each of the fused features to obtain multiple mask visual features.

进一步，上述获取模块具体用于：Furthermore, the acquisition module is specifically used for:

获取目标图像以及所述目标图像的提示标记，其中，所述目标图像包括多个候选对象，所述提示标记用于指示所述多个候选对象中被选择的目标对象；Acquire a target image and a prompt mark of the target image, wherein the target image includes a plurality of candidate objects, and the prompt mark is used to indicate a target object selected from the plurality of candidate objects;

对所述目标图像进行分割，得到各个所述候选对象对应的局部掩模，基于所述提示标记在各个所述局部掩模中确定查询掩模。The target image is segmented to obtain local masks corresponding to the candidate objects, and a query mask is determined in each of the local masks based on the prompt mark.

当所述提示标记为标记点时，基于所述标记点与各个所述局部掩模之间的位置关系，在各个所述局部掩模中确定查询掩模；When the hint mark is a mark point, determining a query mask in each of the local masks based on a positional relationship between the mark point and each of the local masks;

或者，当所述提示标记为标记框时，基于所述标记框与各个所述局部掩模的掩模边界之间的匹配程度，在各个所述局部掩模中确定查询掩模；Alternatively, when the hint mark is a mark box, determining the query mask in each of the local masks based on a matching degree between the mark box and a mask boundary of each of the local masks;

或者，当所述提示标记为标记区域时，基于所述标记区域与各个所述局部掩模之间的匹配程度，在各个所述局部掩模中确定查询掩模。Alternatively, when the hint mark is a marked area, a query mask is determined in each of the local masks based on a matching degree between the marked area and each of the local masks.

进一步，所述目标实体标识由第一大语言模型生成，上述标识生成装置还包括训练模块，训练模块具体用于：Furthermore, the target entity identifier is generated by the first language model, and the identifier generation device further includes a training module, which is specifically used for:

获取样本图像以及所述样本图像中样本对象对应的第二掩模，对所述样本图像进行分割，得到所述样本图像中各个视觉对象对应的第一掩模，其中，所述样本对象为多个所述视觉对象中的一个对象；Acquire a sample image and a second mask corresponding to a sample object in the sample image, segment the sample image, and obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one of the multiple visual objects;

提取所述样本图像的第二图像特征，对所述样本图像分别进行基于所述第二掩模以及各个所述第一掩模的特征提取，得到多个样本视觉特征，提取所述第二掩模以及各个所述第一掩模对应的样本位置特征；Extracting a second image feature of the sample image, performing feature extraction based on the second mask and each of the first masks on the sample image to obtain a plurality of sample visual features, and extracting sample position features corresponding to the second mask and each of the first masks;

将所述第二图像特征、所述样本视觉特征以及所述样本位置特征拼接后输入至所述第一大语言模型进行文本预测，生成预测概率分布，其中，所述预测概率分布用于确定所述样本对象的实体标识；splicing the second image feature, the sample visual feature, and the sample position feature and inputting them into the first language model for text prediction to generate a prediction probability distribution, wherein the prediction probability distribution is used to determine the entity identifier of the sample object;

获取所述样本图像所链接的样本实体的第一实体标识，基于所述预测概率分布与所述第一实体标识确定模型损失，基于所述模型损失训练所述第一大语言模型，其中，所述样本实体用于指示所述样本对象。Obtain a first entity identifier of a sample entity linked to the sample image, determine a model loss based on the predicted probability distribution and the first entity identifier, and train the first large language model based on the model loss, wherein the sample entity is used to indicate the sample object.

进一步，所述样本图像、所述第二掩模以及所述第一实体标识均从数据集中获取，上述训练模块还用于：Furthermore, the sample image, the second mask and the first entity identifier are all obtained from a data set, and the training module is further used for:

获取多个原始图像以及各个所述原始图像对应的查询文本，根据各个所述原始图像以及对应的所述查询文本，分别确定各个所述原始图像对应的识别信息，其中，所述查询文本用于提示识别出对应的所述原始图像中的关注对象；Acquire multiple original images and query texts corresponding to each of the original images, and determine identification information corresponding to each of the original images according to each of the original images and the corresponding query texts, wherein the query texts are used to prompt identification of the object of interest in the corresponding original image;

获取多个候选实体，基于各个所述识别信息，分别在多个所述候选实体中确定各个所述原始图像对应的链接实体；Acquire a plurality of candidate entities, and determine, based on the respective identification information, link entities corresponding to the respective original images from the plurality of candidate entities;

基于各个所述原始图像以及对应的所述查询文本，分别确定各个所述原始图像的标注掩模，其中，所述标注掩模用于指示对应的所述关注对象；Based on each of the original images and the corresponding query text, respectively determine a labeling mask for each of the original images, wherein the labeling mask is used to indicate the corresponding object of interest;

将各个所述原始图像、对应的所述标注掩模以及对应的所述链接实体关联存储至所述数据集，其中，所述样本图像从多个所述原始图像中采样得到，所述第二掩模为所述样本图像对应的所述标注掩模，所述样本实体为所述样本图像所链接的所述链接实体。Each of the original images, the corresponding annotation mask, and the corresponding linked entity are associated and stored in the data set, wherein the sample image is sampled from multiple original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the linked entity linked to the sample image.

进一步，上述训练模块具体用于：Furthermore, the above training module is specifically used for:

将各个所述查询文本分别输入至第二大语言模型进行文本预测，生成各个所述原始图像对应的概括文本；Inputting each of the query texts into the second largest language model for text prediction to generate summary texts corresponding to each of the original images;

分别基于各个所述原始图像和对应的所述概括文本进行对象检测，生成各个所述原始图像对应的原始边界框，其中，所述原始边界框用于指示对应的所述关注对象；Performing object detection based on each of the original images and the corresponding summarized text, respectively, to generate an original bounding box corresponding to each of the original images, wherein the original bounding box is used to indicate the corresponding object of interest;

分别将各个所述原始图像和对应的所述原始边界框输入至第一掩模生成模型进行掩模预测，生成各个所述原始图像对应的标注掩模。Each of the original images and the corresponding original bounding boxes are respectively input into a first mask generation model for mask prediction to generate a labeled mask corresponding to each of the original images.

进一步，上述训练模块还用于：Furthermore, the above training module is also used for:

获取各个所述链接实体对应的参考名称文本，其中，所述参考名称文本用于指示参考实体的名称，所述参考实体在知识库中的层级高于所述链接实体在所述知识库中的层级；Acquire a reference name text corresponding to each of the link entities, wherein the reference name text is used to indicate the name of the reference entity, and the level of the reference entity in the knowledge base is higher than the level of the link entity in the knowledge base;

分别将各个所述原始图像和对应的所述参考名称文本输入至第二掩模生成模型进行掩模预测，生成各个所述原始图像对应的参考掩模；Inputting each of the original images and the corresponding reference name text into a second mask generation model for mask prediction to generate a reference mask corresponding to each of the original images;

确定各个所述标注掩模与对应的所述参考掩模之间的匹配程度，得到各个所述标注掩模对应的目标匹配度；Determining the matching degree between each of the annotated masks and the corresponding reference mask to obtain the target matching degree corresponding to each of the annotated masks;

当所述目标匹配度小于预设的匹配度阈值时，剔除所述目标匹配度对应的所述原始图像。When the target matching degree is less than a preset matching degree threshold, the original image corresponding to the target matching degree is discarded.

统计各个所述标注掩模中连通区域的数量，得到各个所述标注掩模对应的区域数量；Counting the number of connected regions in each of the annotation masks to obtain the number of regions corresponding to each of the annotation masks;

当所述区域数量大于预设的数量阈值时，剔除所述区域数量对应的所述原始图像。When the number of regions is greater than a preset number threshold, the original image corresponding to the number of regions is discarded.

获取所述样本图像所链接的样本实体的样本名称文本，对所述样本名称文本进行分词得到多个第一分词；Acquire a sample name text of a sample entity linked to the sample image, and segment the sample name text to obtain a plurality of first segmented words;

确定各个所述第一分词在知识库中的出现频率，基于各个所述出现频率由小至大的顺序，对各个所述第一分词进行排序，将排列在前L位的所述第一分词确定为第二分词，其中，所述L为正整数；Determine the occurrence frequency of each of the first participles in the knowledge base, sort each of the first participles based on the order of the occurrence frequency from small to large, and determine the first participles arranged in the first L positions as the second participles, where L is a positive integer;

基于各个所述第二分词，确定所述样本实体的第一实体标识。Based on each of the second participles, a first entity identifier of the sample entity is determined.

另一方面，本公开实施例还提供了一种电子设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现上述的标识生成方法。On the other hand, an embodiment of the present disclosure further provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor implements the above-mentioned identification generation method when executing the computer program.

另一方面，本公开实施例还提供了一种计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序被处理器执行实现上述的标识生成方法。On the other hand, an embodiment of the present disclosure further provides a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is executed by a processor to implement the above-mentioned identification generation method.

另一方面，本公开实施例还提供了一种计算机程序产品，该计算机程序产品包括计算机程序，该计算机程序存储在计算机可读存介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序，处理器执行该计算机程序，使得该计算机设备执行实现上述的标识生成方法。On the other hand, the embodiment of the present disclosure further provides a computer program product, which includes a computer program, and the computer program is stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the above-mentioned identification generation method.

本公开实施例至少包括以下有益效果：通过获取目标图像中各个候选对象对应的局部掩模，以及确定用于指示目标对象的查询掩模，进而通过特征提取得到查询掩模以及各个局部掩模对应的掩模视觉特征，并提取查询掩模以及各个局部掩模对应的掩模位置特征，掩模视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息，掩模位置特征能够捕捉到相应对象所在局部区域的像素级位置信息，然后将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到各个局部区域的区域特征，相当于将局部区域的像素级视觉信息和对应的像素级位置信息组合为像素级区域信息，然后提取目标图像的第一图像特征，并将第一图像特征以及多个区域特征拼接为目标拼接特征，然后基于目标拼接特征进行文本预测，生成目标对象的目标实体标识，在文本预测过程中，通过对目标拼接特征中的各个特征进行交互，既能关注第一图像特征所捕捉的全局视觉信息，又能关注各个候选对象对应的像素级区域信息，还能关注目标对象对应的像素级区域信息，因此，能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率，另外，将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。The disclosed embodiments include at least the following beneficial effects: by acquiring the local mask corresponding to each candidate object in the target image, and determining the query mask used to indicate the target object, and then obtaining the query mask and the mask visual features corresponding to each local mask through feature extraction, and extracting the query mask and the mask position features corresponding to each local mask, the mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located, the mask position features can capture the pixel-level position information of the local area where the corresponding object is located, and then splicing each mask visual feature with the corresponding mask position feature respectively to obtain the regional features of each local area, which is equivalent to combining the pixel-level visual information of the local area and the corresponding pixel-level position information into pixel-level regional information, and then extracting the target image. The first image feature of the image is captured, and the first image feature and multiple regional features are spliced into a target splicing feature. Then, text prediction is performed based on the target splicing feature to generate a target entity identifier of the target object. In the text prediction process, by interacting with each feature in the target splicing feature, it is possible to pay attention to both the global visual information captured by the first image feature and the pixel-level regional information corresponding to each candidate object and the pixel-level regional information corresponding to the target object. Therefore, it is possible to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier. In addition, using the pixel-level query mask as a visual cue can efficiently, flexibly and accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identifier.

本公开的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本公开而了解。Other features and advantages of the present disclosure will be set forth in the following description, and in part will be apparent from the description, or may be learned by practicing the present disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用来提供对本公开技术方案的进一步理解，并且构成说明书的一部分，与本公开的实施例一起用于解释本公开的技术方案，并不构成对本公开技术方案的限制。The accompanying drawings are used to provide further understanding of the technical solution of the present disclosure and constitute a part of the specification. Together with the embodiments of the present disclosure, they are used to explain the technical solution of the present disclosure and do not constitute a limitation on the technical solution of the present disclosure.

图1为本公开实施例提供的一种可选的实施环境的示意图；FIG1 is a schematic diagram of an optional implementation environment provided by an embodiment of the present disclosure;

图2为本公开实施例提供的标识生成方法的一种可选的流程示意图；FIG2 is a schematic diagram of an optional flow chart of a method for generating an identifier provided in an embodiment of the present disclosure;

图3为本公开实施例提供的确定掩模视觉特征的一种可选的流程示意图；FIG3 is a schematic diagram of an optional process for determining visual features of a mask provided by an embodiment of the present disclosure;

图4为本公开实施例提供的生成掩模的一种可选的流程示意图；FIG4 is a schematic diagram of an optional process of generating a mask provided in an embodiment of the present disclosure;

图5为本公开实施例提供的更新数据集的一种可选的流程示意图；FIG5 is a schematic diagram of an optional process of updating a data set provided in an embodiment of the present disclosure;

图6为本公开实施例提供的剔除原始图像的一种可选的架构示意图；FIG6 is a schematic diagram of an optional architecture for removing original images provided by an embodiment of the present disclosure;

图7为本公开实施例提供的多种样本集中实体类别的一种可选的分布示意图；FIG7 is a schematic diagram of an optional distribution of entity categories in various sample sets provided by an embodiment of the present disclosure;

图8为本公开实施例提供的优化样本集中实体类别的一种可选的饼状示意图；FIG8 is an optional pie-shaped schematic diagram of entity categories in the optimized sample set provided by an embodiment of the present disclosure;

图9为本公开实施例提供的标注掩模的面积比的一种可选的分布示意图；FIG9 is a schematic diagram of an optional distribution of area ratios of annotation masks provided in an embodiment of the present disclosure;

图10为本公开实施例提供的训练阶段的一种可选的构架示意图；FIG10 is a schematic diagram of an optional architecture of the training phase provided in an embodiment of the present disclosure;

图11为本公开实施例提供的标识生成装置的一种可选的结构示意图；FIG11 is a schematic diagram of an optional structure of an identification generating device provided in an embodiment of the present disclosure;

图12为本公开实施例提供的终端的部分结构框图；FIG12 is a partial structural block diagram of a terminal provided in an embodiment of the present disclosure;

图13为本公开实施例提供的服务器的部分结构框图。FIG. 13 is a partial structural block diagram of a server provided in an embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

为了使本公开的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本公开进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本公开，并不用于限定本公开。In order to make the purpose, technical solution and advantages of the present disclosure more clearly understood, the present disclosure is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present disclosure and are not used to limit the present disclosure.

需要说明的是，在本公开的各个具体实施方式中，当涉及到需要根据目标对象属性信息或属性信息集合等与目标对象特性相关的数据进行相关处理时，都会先获得目标对象的许可或者同意，而且，对这些数据的收集、使用和处理等，都会遵守相关法律法规和标准。其中，目标对象可以是用户。此外，当本公开实施例需要获取目标对象属性信息时，会通过弹窗或者跳转到确认页面等方式获得目标对象的单独许可或者单独同意，在明确获得目标对象的单独许可或者单独同意之后，再获取用于使本公开实施例能够正常运行的必要的目标对象相关数据。It should be noted that in various specific embodiments of the present disclosure, when it comes to the need to perform relevant processing based on data related to the characteristics of the target object such as the target object attribute information or attribute information set, the permission or consent of the target object will be obtained first, and the collection, use and processing of these data will comply with relevant laws, regulations and standards. Among them, the target object can be a user. In addition, when the embodiment of the present disclosure needs to obtain the attribute information of the target object, the separate permission or separate consent of the target object will be obtained through a pop-up window or by jumping to a confirmation page. After clearly obtaining the separate permission or separate consent of the target object, the necessary target object-related data for enabling the normal operation of the embodiment of the present disclosure will be obtained.

本公开实施例中，术语“模块”或“单元”是指有预定功能的计算机程序或计算机程序的一部分，并与其他相关部分一起工作以实现预定目标，并且可以通过使用软件、硬件（如处理电路或存储器）或其组合来全部或部分实现。同样的，一个处理器（或多个处理器或存储器）可以用来实现一个或多个模块或单元。此外，每个模块或单元都可以是包含该模块或单元功能的整体模块或单元的一部分。In the embodiments of the present disclosure, the term "module" or "unit" refers to a computer program or a part of a computer program that has a predetermined function and works together with other related parts to achieve a predetermined goal, and can be implemented in whole or in part by using software, hardware (such as processing circuits or memories), or a combination thereof. Similarly, a processor (or multiple processors or memories) can be used to implement one or more modules or units. In addition, each module or unit can be part of an overall module or unit that includes the function of the module or unit.

为便于理解本公开实施例提供的技术方案，这里先对本公开实施例使用的一些关键名词进行解释：To facilitate understanding of the technical solution provided by the embodiments of the present disclosure, some key terms used in the embodiments of the present disclosure are explained here:

视觉实体链接（Visual Entity Linking，VEL），是指将图像中的对象与知识库中的相应实体进行匹配。Visual Entity Linking (VEL) refers to matching objects in an image with corresponding entities in a knowledge base.

在相关技术中，通常通过将待识别图像编码为全局图像特征，同时将描述待识别图像中的目标对象的查询文本作为视觉提示，然后基于全局图像特征和查询文本的文本特征预测出图像对应的实体标识。然而，全局图像特征通常会忽略图像中的局部细节，导致信息丢失，从而降低实体标识的预测准确率。In the related art, the image to be identified is usually encoded into global image features, and the query text describing the target object in the image to be identified is used as a visual cue, and then the entity identification corresponding to the image is predicted based on the global image features and the text features of the query text. However, global image features usually ignore local details in the image, resulting in information loss, thereby reducing the prediction accuracy of entity identification.

基于此，本公开实施例提供了一种标识生成方法、装置、电子设备及存储介质，能够提高目标实体标识的预测准确率。Based on this, the embodiments of the present disclosure provide an identification generation method, device, electronic device and storage medium, which can improve the prediction accuracy of the target entity identification.

参照图1，图1为本公开实施例提供的一种可选的实施环境的示意图，该实施环境包括终端101和服务器102，其中，终端101和服务器102之间通过通信网络连接。Referring to FIG. 1 , FIG. 1 is a schematic diagram of an optional implementation environment provided by an embodiment of the present disclosure, the implementation environment includes a terminal 101 and a server 102 , wherein the terminal 101 and the server 102 are connected via a communication network.

示例性地，服务器102可以获取终端发送的目标图像，然后获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，其中，查询掩模用于指示多个候选对象中被选择的目标对象；对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征；将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征；基于目标拼接特征进行文本预测，生成目标对象的目标实体标识，然后服务器102将目标实体标识发送至终端101。Exemplarily, the server 102 can obtain a target image sent by the terminal, and then obtain local masks corresponding to each candidate object in the target image, and determine a query mask from multiple local masks, wherein the query mask is used to indicate a target object selected from multiple candidate objects; perform feature extraction based on the query mask and each local mask on the target image to obtain multiple mask visual features, and extract mask position features corresponding to the query mask and each local mask; splice each mask visual feature with the corresponding mask position feature to obtain multiple regional features; extract the first image feature of the target image, and splice the first image feature and multiple regional features to obtain a target splicing feature; perform text prediction based on the target splicing feature to generate a target entity identifier of the target object, and then the server 102 sends the target entity identifier to the terminal 101.

服务器102通过获取目标图像中各个候选对象对应的局部掩模，以及确定用于指示目标对象的查询掩模，进而通过特征提取得到查询掩模以及各个局部掩模对应的掩模视觉特征，并提取查询掩模以及各个局部掩模对应的掩模位置特征，掩模视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息，掩模位置特征能够捕捉到相应对象所在局部区域的像素级位置信息，然后将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到各个局部区域的区域特征，相当于将局部区域的像素级视觉信息和对应的像素级位置信息组合为像素级区域信息，然后提取目标图像的第一图像特征，并将第一图像特征以及多个区域特征拼接为目标拼接特征，然后基于目标拼接特征进行文本预测，生成目标对象的目标实体标识，在文本预测过程中，通过对目标拼接特征中的各个特征进行交互，既能关注第一图像特征所捕捉的全局视觉信息，又能关注各个候选对象对应的像素级区域信息，还能关注目标对象对应的像素级区域信息，因此，能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率，另外，将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。The server 102 obtains the local masks corresponding to each candidate object in the target image, and determines the query mask used to indicate the target object, and then obtains the query mask and the mask visual features corresponding to each local mask through feature extraction, and extracts the mask position features corresponding to the query mask and each local mask. The mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located, and the mask position features can capture the pixel-level position information of the local area where the corresponding object is located. Then, each mask visual feature is spliced with the corresponding mask position features to obtain the regional features of each local area, which is equivalent to combining the pixel-level visual information of the local area and the corresponding pixel-level position information into pixel-level regional information, and then extracting the first image of the target image. The first image feature and multiple regional features are spliced into a target splicing feature, and then text prediction is performed based on the target splicing feature to generate a target entity identifier of the target object. In the text prediction process, by interacting with each feature in the target splicing feature, it is possible to pay attention to both the global visual information captured by the first image feature and the pixel-level regional information corresponding to each candidate object and the pixel-level regional information corresponding to the target object. Therefore, it is possible to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier. In addition, using the pixel-level query mask as a visual cue can efficiently, flexibly and accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identifier.

服务器102可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN（Content Delivery Network，内容分发网络）、以及大数据和人工智能平台等基础云计算服务的云服务器。另外，服务器102还可以是区块链网络中的一个节点服务器。Server 102 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. In addition, server 102 may also be a node server in a blockchain network.

终端101可以是手机、电脑、智能语音交互设备、智能家电、车载终端等，但并不局限于此。终端101以及服务器102可以通过有线或无线通信方式进行直接或间接地连接，本公开实施例在此不做限制。The terminal 101 may be a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, etc., but is not limited thereto. The terminal 101 and the server 102 may be directly or indirectly connected via wired or wireless communication, which is not limited in the embodiments of the present disclosure.

参照图2，图2为本公开实施例提供的标识生成方法的一种可选的流程示意图，该标识生成方法可以由服务器执行，或者也可以由终端执行，或者也可以由服务器配合终端执行，该标识生成方法包括但不限于以下步骤201至步骤205。Referring to Figure 2, Figure 2 is an optional flow chart of an identification generation method provided in an embodiment of the present disclosure. The identification generation method can be executed by a server, or by a terminal, or by a server in cooperation with a terminal. The identification generation method includes but is not limited to the following steps 201 to 205.

步骤201：获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模。Step 201: Obtain a local mask corresponding to each candidate object in a target image, and determine a query mask from multiple local masks.

其中，目标图像是指需要进行视觉实体链接的图像，目标图像可包含多个候选对象，例如，候选对象可包括动物、植物、物品等主要对象，候选对象也可包括自然景观、建筑物、城市街道等背景对象。各个局部掩模分别用于指示对应的候选对象，各个局部掩模均能够在目标图像中指定对应的候选对象所在的区域，相当于局部掩模能够指代对应的候选对象；查询掩模用于指示多个候选对象中被选择的目标对象，查询掩模能够在目标图像中指定目标对象所在的局部区域，相当于查询掩模能够指代目标对象。The target image refers to an image that needs to be visually linked to entities. The target image may contain multiple candidate objects. For example, the candidate objects may include main objects such as animals, plants, and objects. The candidate objects may also include background objects such as natural landscapes, buildings, and city streets. Each local mask is used to indicate the corresponding candidate object. Each local mask can specify the area where the corresponding candidate object is located in the target image, which is equivalent to the local mask being able to refer to the corresponding candidate object; the query mask is used to indicate the target object selected from multiple candidate objects. The query mask can specify the local area where the target object is located in the target image, which is equivalent to the query mask being able to refer to the target object.

具体地，各个局部掩模均可为与目标图像尺寸相同的二值图像，在任意一个局部掩模中，所有像素值为1的像素点构成的关注区域用于指示目标图像中对应的候选对象所在的局部区域，所有像素值为0的像素点构成的非关注区域用于指示目标图像中其余的区域；可选地，局部掩模也可为其他尺寸的图像，局部掩模能够指示对应的候选对象即可，本公开实施例在此不作限定。Specifically, each local mask can be a binary image of the same size as the target image. In any local mask, the focus area composed of all pixels with a pixel value of 1 is used to indicate the local area where the corresponding candidate object is located in the target image, and the non-focus area composed of all pixels with a pixel value of 0 is used to indicate the remaining area in the target image; optionally, the local mask can also be an image of other sizes, as long as the local mask can indicate the corresponding candidate object, and the embodiments of the present disclosure are not limited here.

类似地，查询掩模可为与目标图像尺寸相同的二值图像，在查询掩模中，所有像素值为1的像素点构成的关注区域用于指示目标图像中的目标对象所在的局部区域，所有像素值为0的像素点构成的非关注区域用于指示目标图像中其余的区域；可选地，查询掩模也可为其他尺寸的图像，查询掩模能够指示目标对象即可，本公开实施例在此不作限定。Similarly, the query mask may be a binary image of the same size as the target image. In the query mask, the focus area composed of all pixels with a pixel value of 1 is used to indicate the local area where the target object is located in the target image, and the non-focus area composed of all pixels with a pixel value of 0 is used to indicate the rest of the target image. Optionally, the query mask may also be an image of other sizes, as long as the query mask can indicate the target object, which is not limited in the embodiments of the present disclosure.

基于此，由于查询掩模是在多个局部掩模中确定的，所以查询掩模具体是基于相关人员的选择确定的，查询掩模能够表征相关人员的指代意图，目标对象是被相关人员选择的对象，目标对象可以是多个候选对象中的任意一个对象。Based on this, since the query mask is determined among multiple local masks, the query mask body is determined based on the selection of the relevant personnel, the query mask can characterize the reference intention of the relevant personnel, the target object is the object selected by the relevant personnel, and the target object can be any one of the multiple candidate objects.

步骤202：对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征。Step 202: Perform feature extraction based on the query mask and each local mask on the target image to obtain multiple mask visual features, and extract mask position features corresponding to the query mask and each local mask.

其中，对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，具体是指对目标图像进行基于查询掩模的特征提取，以及对目标图像分别进行基于各个局部掩模的特征提取；假设局部掩模的数量为两个，那么对目标图像分别进行基于各个局部掩模的特征提取，具体是指对目标图像进行基于第一个局部掩模的特征提取，以及对目标图像进行基于第二个局部掩模的特征提取；每当对目标图像进行特征提取时，都能得到对应的掩模视觉特征。Among them, feature extraction is performed on the target image based on the query mask and each local mask respectively, specifically referring to feature extraction based on the query mask and feature extraction based on each local mask respectively; assuming that the number of local masks is two, feature extraction is performed on the target image based on each local mask respectively, specifically referring to feature extraction based on the first local mask and feature extraction based on the second local mask; whenever feature extraction is performed on the target image, the corresponding mask visual features can be obtained.

基于此，对目标图像进行基于掩模的特征提取得到掩模视觉特征，相当于从目标图像中提取出更抽象且更具有信息量的局部视觉特征；由于查询掩模指代目标对象，以及局部掩模指代对应的候选对象，所以掩模视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息；另外，提取查询掩模以及各个局部掩模对应的掩模位置特征，相当于分别确定各个掩模之间的空间位置关系，由于查询掩模指代目标对象，以及局部掩模指代对应的候选对象，所以掩模位置特征能够捕捉到相应对象所在局部区域的像素级位置信息。Based on this, mask-based feature extraction is performed on the target image to obtain mask visual features, which is equivalent to extracting more abstract and more informative local visual features from the target image; since the query mask refers to the target object and the local mask refers to the corresponding candidate object, the mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located; in addition, extracting the query mask and the mask position features corresponding to each local mask is equivalent to respectively determining the spatial position relationship between each mask, and since the query mask refers to the target object and the local mask refers to the corresponding candidate object, the mask position features can capture the pixel-level position information of the local area where the corresponding object is located.

在一种可能的实现方式中，提取查询掩模以及各个局部掩模对应的掩模位置特征，具体可以是分别将查询掩模以及各个局部掩模展平后输入至位置编码器进行映射，得到查询掩模以及各个局部掩模对应的掩模位置特征。In one possible implementation, the mask position features corresponding to the query mask and each local mask are extracted. Specifically, the query mask and each local mask are flattened and input into a position encoder for mapping to obtain the mask position features corresponding to the query mask and each local mask.

步骤203：将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征。Step 203: Each mask visual feature is spliced with the corresponding mask position feature to obtain a plurality of regional features.

其中，将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，具体是将每个局部掩模对应的掩模视觉特征与对应的掩模位置特征进行拼接，以及将查询掩模对应的掩模视觉特征与对应的掩模位置特征进行拼接。Among them, each mask visual feature is spliced with the corresponding mask position feature respectively, specifically, the mask visual feature corresponding to each local mask is spliced with the corresponding mask position feature, and the mask visual feature corresponding to the query mask is spliced with the corresponding mask position feature.

例如，对于查询掩模以及各个局部掩模中的任意一个掩模，可将该掩模对应的掩模视觉特征拼接在对应的掩模位置特征的首端或者尾端，得到该掩模对应的区域特征；因此，查询掩模以及各个局部掩模中的任意一个掩模都有对应的区域特征，区域特征为掩模视觉特征与对应的掩模位置特征的组合结果。For example, for the query mask and any one of the local masks, the mask visual features corresponding to the mask can be spliced at the beginning or end of the corresponding mask position features to obtain the regional features corresponding to the mask; therefore, the query mask and any one of the local masks have corresponding regional features, and the regional features are the combination results of the mask visual features and the corresponding mask position features.

基于此，通过拼接能够得到查询掩模以及各个局部掩模中每个掩模对应的区域特征，即得到每个对象所在局部区域的区域特征，该局部区域的区域特征相当于目标图像的局部特征；具体来说，将掩模视觉特征与掩模位置特征进行拼接，相当于将目标对象以及各个候选对象中每个对象所在局部区域的像素级视觉信息和对应的像素级位置信息进行组合，能够得到每个对象所在局部区域的像素级区域信息，因此，各个区域特征都能表征相应对象所在局部区域的区域细节，后续能够基于像素级区域信息提高对目标图像的像素级细节的理解。Based on this, the query mask and the regional features corresponding to each mask in each local mask can be obtained through splicing, that is, the regional features of the local region where each object is located are obtained, and the regional features of the local region are equivalent to the local features of the target image; specifically, splicing the mask visual features with the mask position features is equivalent to combining the pixel-level visual information and the corresponding pixel-level position information of the local region where the target object and each object in each candidate object are located, so that the pixel-level regional information of the local region where each object is located can be obtained. Therefore, each regional feature can characterize the regional details of the local region where the corresponding object is located, and subsequently the understanding of the pixel-level details of the target image can be improved based on the pixel-level regional information.

步骤204：提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征。Step 204: extracting a first image feature of the target image, and splicing the first image feature and a plurality of regional features to obtain a target splicing feature.

其中，将第一图像特征以及多个区域特征进行拼接，具体可将多个区域特征进行依次拼接，得到拼接结果，然后将第一图像特征拼接在该拼接结果的首端，得到目标拼接特征；除此之外，也可采用其他拼接顺序得到目标拼接特征，本公开实施例在此不作限定。Among them, the first image feature and multiple regional features are spliced. Specifically, the multiple regional features can be spliced in sequence to obtain a splicing result, and then the first image feature is spliced at the head end of the splicing result to obtain a target splicing feature. In addition, other splicing sequences can also be used to obtain the target splicing feature, which is not limited in the embodiments of the present disclosure.

基于此，提取目标图像的第一图像特征，相当于从目标图像中提取出更抽象且更具有信息量的全局图像特征，第一图像特征能够捕捉到目标图像的全局视觉信息，将第一图像特征以及多个区域特征拼接为目标拼接特征，后续能够基于目标拼接特征既关注全局视觉信息，又关注像素级区域信息。Based on this, extracting the first image feature of the target image is equivalent to extracting a more abstract and more informative global image feature from the target image. The first image feature can capture the global visual information of the target image, and the first image feature and multiple regional features are spliced into a target splicing feature. Subsequently, based on the target splicing feature, both global visual information and pixel-level regional information can be focused on.

步骤205：基于目标拼接特征进行文本预测，生成目标对象的目标实体标识。Step 205: Perform text prediction based on the target concatenation features to generate a target entity identifier of the target object.

其中，目标实体标识用于指示知识库中的目标实体，目标实体标识具有唯一性，即知识库中不同的实体会对应不同的目标实体标识，在生成目标实体标识之后，可基于目标实体标识将目标对象与目标实体标识指示的目标实体进行链接。Among them, the target entity identifier is used to indicate the target entity in the knowledge base. The target entity identifier is unique, that is, different entities in the knowledge base will correspond to different target entity identifiers. After the target entity identifier is generated, the target object can be linked to the target entity indicated by the target entity identifier based on the target entity identifier.

具体地，目标实体标识具体可为单个标识符，也可以为由多个标识符组成的序列，本公开实施例在此不作限定，例如，目标实体标识可为[50,10,3]的序列，该目标实体标识包括3个标识符，分别为50、10和3。Specifically, the target entity identifier can be a single identifier or a sequence of multiple identifiers, which is not limited in the embodiments of the present disclosure. For example, the target entity identifier can be a sequence of [50, 10, 3], which includes 3 identifiers, namely 50, 10 and 3.

值得注意的是，目标实体标识用于指示知识库中的实体，在知识库中，每个实体通常被设计为表示一个独一无二的对象，并对应一个全局唯一的标签，不同的标签能够指示不同的实体，例如，假设知识库中的某个实体的名称为“高尔夫球场”，则该实体对应的对象为高尔夫球场，该实体可表示为e=Q1048XXX，e为实体（entity），Q1048XXX为标签。It is worth noting that the target entity identifier is used to indicate the entity in the knowledge base. In the knowledge base, each entity is usually designed to represent a unique object and corresponds to a globally unique label. Different labels can indicate different entities. For example, assuming that the name of an entity in the knowledge base is "golf course", the object corresponding to the entity is the golf course. The entity can be represented as e=Q1048XXX, where e is the entity and Q1048XXX is the label.

基于此，在文本预测过程中，通过对目标拼接特征中的各个特征进行交互，既能关注第一图像特征所捕捉的全局视觉信息，又能关注各个候选对象对应的像素级区域信息，还能关注目标对象对应的像素级区域信息，因此，能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率，另外，将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。Based on this, in the text prediction process, by interacting with each feature in the target splicing feature, we can pay attention to the global visual information captured by the first image feature, the pixel-level area information corresponding to each candidate object, and the pixel-level area information corresponding to the target object. Therefore, we can effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identification. In addition, using the pixel-level query mask as a visual cue can efficiently, flexibly and accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identification.

在一种可能的实现方式中，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征，具体可以是分别确定各个局部掩模的掩模面积，按照掩模面积的大小顺序，将各个局部掩模对应的区域特征进行拼接，得到第一拼接特征；将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征。In one possible implementation, the first image feature and multiple regional features are spliced to obtain a target spliced feature. Specifically, the mask area of each local mask is determined separately, and the regional features corresponding to each local mask are spliced in order of the size of the mask area to obtain a first spliced feature; the first image feature, the first splicing feature and the regional features corresponding to the query mask are spliced to obtain a target spliced feature.

其中，局部掩模的掩模面积是指该局部掩模中关注区域的区域面积，局部掩模中关注区域通常由所有像素值为1的像素点构成，该关注区域用于指示目标图像中对应的候选对象所在的局部区域；因此，当掩模面积越大时，代表对应的候选对象所在的局部区域越大，即该候选对象在目标图像中占据的空间越大，该候选对象的视觉显著性越高，反之，当掩模面积越小时，代表对应的候选对象所在的局部区域越小，即该候选对象在目标图像中占据的空间越小，该候选对象的视觉显著性越低。Among them, the mask area of the local mask refers to the area of the focus area in the local mask. The focus area in the local mask is usually composed of pixels with all pixel values of 1. The focus area is used to indicate the local area where the corresponding candidate object is located in the target image; therefore, when the mask area is larger, the corresponding local area where the candidate object is located is larger, that is, the larger the space occupied by the candidate object in the target image, the higher the visual significance of the candidate object; conversely, when the mask area is smaller, the corresponding local area where the candidate object is located is smaller, that is, the smaller the space occupied by the candidate object in the target image, the lower the visual significance of the candidate object.

基于此，按照掩模面积的大小顺序，将各个局部掩模对应的区域特征进行拼接，具体可将各个局部掩模按照掩模面积大小顺序进行排序，相当于将各个局部掩模按照视觉显著性的高低顺序进行排序，然后将排列后的各个局部掩模对应的区域特征依次进行拼接，得到第一拼接特征，因此，在文本预测过程中，能够按照固定的视觉注意顺序，依次关注各个候选对象所在局部区域的像素级区域信息，从而有效对目标图像的像素级细节的理解。Based on this, the regional features corresponding to each local mask are spliced in the order of the size of the mask area. Specifically, the local masks can be sorted in the order of the size of the mask area, which is equivalent to sorting the local masks in the order of visual significance, and then the regional features corresponding to the arranged local masks are spliced in turn to obtain the first splicing feature. Therefore, in the text prediction process, it is possible to focus on the pixel-level regional information of the local area where each candidate object is located in turn according to a fixed visual attention order, thereby effectively understanding the pixel-level details of the target image.

另外，将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征，因此，在文本预测过程中，能够关注第一图像特征所捕捉的全局视觉信息，从而有效提高对目标图像的全局图像特征的理解，还能关注目标对象所在局部区域的像素级区域信息，实现通过查询掩模高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。In addition, the first image feature, the first stitching feature and the regional features corresponding to the query mask are stitched together to obtain the target stitching feature. Therefore, in the text prediction process, it is possible to focus on the global visual information captured by the first image feature, thereby effectively improving the understanding of the global image features of the target image, and it is also possible to focus on the pixel-level regional information of the local area where the target object is located, thereby achieving efficient, flexible and accurate reference to the target object through the query mask, thereby further improving the prediction accuracy of the target entity identification.

例如，各个局部掩模是按照掩模面积由大至小顺序进行排序的，那么在第一拼接特征中，依次拼接的各个区域特征所对应的候选对象的占据空间是越来越小的，在文本预测过程中，先关注较大占据空间的候选对象对应的区域特征，后关注较小占据空间的候选对象对应的区域特征，模拟了人类视觉注意顺序，后续在文本预测过程中确保占据空间较大的候选对象能够受到更广泛的关注，通常有助于提高目标实体标识的预测准确率。For example, each local mask is sorted in descending order according to the mask area. Then, in the first splicing feature, the space occupied by the candidate objects corresponding to the regional features spliced in sequence is getting smaller and smaller. In the text prediction process, the regional features corresponding to the candidate objects with larger occupied spaces are first focused on, and then the regional features corresponding to the candidate objects with smaller occupied spaces are focused on, which simulates the order of human visual attention. In the subsequent text prediction process, it ensures that the candidate objects with larger occupied spaces can receive more extensive attention, which usually helps to improve the prediction accuracy of the target entity identification.

又例如，各个局部掩模是按照掩模面积由小至大顺序进行排序的，那么在第一拼接特征中，依次拼接的各个区域特征所对应的候选对象的占据空间是越来越大的，在文本预测过程中，先关注较小占据空间的候选对象对应的区域特征，后关注较大占据空间的候选对象对应的区域特征，在特定场景中有助于提高目标实体标识的预测准确率。For another example, each local mask is sorted in order of mask area from small to large. Then, in the first splicing feature, the space occupied by the candidate objects corresponding to the regional features spliced in sequence is getting larger and larger. In the text prediction process, we first focus on the regional features corresponding to the candidate objects that occupy a smaller space, and then focus on the regional features corresponding to the candidate objects that occupy a larger space. This helps to improve the prediction accuracy of the target entity identification in specific scenarios.

具体地，由于第一图像特征可视为目标图像全局的特征，区域特征可视为目标图像局部的特征，所以由第一图像特征以及第一拼接特征拼接得到的特征可视为综合图像特征，以各个局部掩模按照掩模面积由大至小顺序进行排序为例，综合图像特征的确定公式如下：Specifically, since the first image feature can be regarded as a global feature of the target image, and the regional feature can be regarded as a local feature of the target image, the feature obtained by splicing the first image feature and the first splicing feature can be regarded as a comprehensive image feature. Taking the order of each local mask from large to small according to the mask area as an example, the formula for determining the comprehensive image feature is as follows:

其中，为综合图像特征，为第一图像特征，为第一个局部掩模，为第二个局部掩模，为第个局部掩模，为第一个局部掩模对应的区域特征，为第二个局部掩模对应的区域特征，为第个局部掩模对应的区域特征，为由多个区域特征拼接而成的第一拼接特征，第一拼接特征可视为包括多个区域特征的局部特征序列，即，为至中的任意一个区域特征，为该局部特征序列的长度，即为该局部特征序列中局部特征的个数，可见，第一拼接特征也可表示为。in, is the comprehensive image feature, is the first image feature, is the first local mask, is the second local mask, For the A local mask, is the regional feature corresponding to the first local mask, is the regional feature corresponding to the second local mask, For the The regional features corresponding to the local masks, is the first splicing feature formed by splicing multiple regional features. The first splicing feature can be regarded as a local feature sequence including multiple regional features, that is, , for to Any regional feature in is the length of the local feature sequence, that is is the number of local features in the local feature sequence. It can be seen that the first splicing feature can also be expressed as .

另外，各个局部掩模的掩模面积之间的大小关系满足以下公式：In addition, the size relationship between the mask areas of each local mask satisfies the following formula:

其中，为第个局部掩模的掩模面积，为第个局部掩模的掩模面积，即前一个局部掩模的掩模面积大于或者等于局部掩模的掩模面积，能够确保各个局部掩模是按照掩模面积由大至小顺序进行排序的，即是基于掩模面积的降序对各个区域特征进行排序得到的。in, For the The mask area of a local mask, For the The mask area of a local mask, that is, the mask area of the previous local mask is greater than or equal to the mask area of the local mask, which can ensure that the local masks are sorted in descending order according to the mask area, that is, It is obtained by sorting the features of each region in descending order based on the mask area.

值得注意的是，假设掩模视觉特征以及掩模位置特征视为由掩模感知视觉提取器提取得到，而各个局部掩模均由语义分割模型分割得到，那么第一拼接特征的确定公式如下：It is worth noting that, assuming that the mask visual features and the mask position features are considered to be extracted by the mask-aware visual extractor, and each local mask is segmented by the semantic segmentation model, the formula for determining the first splicing feature is as follows:

其中，为第一拼接特征，为目标图像，为任意一个局部掩模,为局部掩模对应的掩模视觉特征，为局部掩模对应的掩模位置特征，为掩模感知视觉提取器，为语义分割模型。in, is the first splicing feature, is the target image, For any local mask, For local mask The corresponding mask visual features, For local mask The corresponding mask position features, is a mask-aware visual extractor, It is a semantic segmentation model.

需要说明的是,、以及中的为目标图像，中的是指全局（global），该用于指示第一图像特征为目标图像的全局特征，中的是指局部（local），该用于指示第一拼接特征中的区域特征为目标图像的局部特征。It should be noted that, , as well as In is the target image, In Refers to the global (global), the used to indicate that the first image feature is a global feature of the target image, In Refers to local. It is used to indicate that the regional feature in the first stitching feature is a local feature of the target image.

在一种可能的实现方式中，将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征，具体可以是将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行依次拼接，得到目标拼接特征。基于此，将第一拼接特征拼接在第一图像特征的尾端，在文本预测过程中，能够先关注目标图像中用于捕捉整体信息的全局特征，后关注目标图像中用于捕捉精细细节的局部特征，通常有助于提高目标实体标识的预测准确率。In a possible implementation, the first image feature, the first stitching feature, and the regional feature corresponding to the query mask are stitched together to obtain a target stitching feature. Specifically, the first image feature, the first stitching feature, and the regional feature corresponding to the query mask are stitched together in sequence to obtain a target stitching feature. Based on this, the first stitching feature is spliced at the end of the first image feature. In the text prediction process, the global features in the target image used to capture overall information can be focused on first, and then the local features in the target image used to capture fine details can be focused on, which usually helps to improve the prediction accuracy of the target entity identification.

在一种可能的实现方式中，目标实体标识由第一大语言模型生成，将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征，具体可以是构建用于提示第一大语言模型生成实体标识的提示文本；提取提示文本的文本特征，将第一图像特征、文本特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征。In a possible implementation, the target entity identifier is generated by the first large language model, and the first image feature, the first splicing feature, and the regional feature corresponding to the query mask are spliced to obtain the target splicing feature. Specifically, a prompt text is constructed to prompt the first large language model to generate the entity identifier; the text feature of the prompt text is extracted, and the first image feature, the text feature, the first splicing feature, and the regional feature corresponding to the query mask are spliced to obtain the target splicing feature.

其中，第一大语言模型属于大语言模型（Large Language Model，LLM），大语言模型是使用大量文本数据训练的深度学习模型，可以生成自然语言文本或理解语言文本的含义；大语言模型一般采用循环神经网络(RNN)或变种，如长短时记忆网络(LSTM)和门控循环单元(GRU)，以捕捉文本序列中的上下文信息，从而实现自然语言文本的生成、语言模型评估、文本分类、情感分析等任务；在自然语言处理领域，大语言模型已经被广泛应用，例如语音识别、机器翻译、自动摘要、对话系统、智能问答等；此处，目标实体标识属于自然语言文本，第一大语言模型用于处理生成目标实体标识的任务。Among them, the first language model belongs to the large language model (LLM), which is a deep learning model trained with a large amount of text data. It can generate natural language text or understand the meaning of language text. The large language model generally uses a recurrent neural network (RNN) or its variants, such as a long short-term memory network (LSTM) and a gated recurrent unit (GRU), to capture contextual information in a text sequence, thereby achieving tasks such as natural language text generation, language model evaluation, text classification, and sentiment analysis. In the field of natural language processing, large language models have been widely used, such as speech recognition, machine translation, automatic summarization, dialogue systems, intelligent question and answer, etc. Here, the target entity identifier belongs to the natural language text, and the first language model is used to process the task of generating the target entity identifier.

基于此，提示文本可视为提示指令（Prompt），提示指令可以理解为一种启动大语言模型的方式，提示指令能够指导大语言模型生成特定类型、主题或格式的内容，所以通过构建提示第一大语言模型生成实体标识的提示文本，目标拼接特征除了包含第一图像特征以及第一拼接特征以及查询掩模对应的区域特征以外，目标拼接特征还包含了提示文本的文本特征，后续将目标拼接特征作为第一大语言模型的输入，在提示文本的指导下，能够提升第一大语言模型对目标实体标识的生成质量，另外，通过在第一大语言模型内引入像素级区域特征的交叉注意力交互，还能够有效提高第一大语言模型对目标图像的全局特征以及像素级细节的理解，从而提高目标实体标识的预测准确率。Based on this, the prompt text can be regarded as a prompt instruction (Prompt), which can be understood as a way to start the large language model. The prompt instruction can guide the large language model to generate content of a specific type, theme or format. Therefore, by constructing a prompt text that prompts the first large language model to generate an entity identifier, the target splicing feature includes not only the first image feature, the first splicing feature and the regional feature corresponding to the query mask, but also the text feature of the prompt text. The target splicing feature is subsequently used as the input of the first large language model. Under the guidance of the prompt text, the generation quality of the target entity identifier by the first large language model can be improved. In addition, by introducing the cross-attention interaction of pixel-level regional features in the first large language model, the first large language model can effectively improve its understanding of the global features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier.

具体地，第一图像特征、文本特征以及区域特征构成了大语言模型的多模态输入，因此第一大语言模型可采用多模态大语言模型（Multimodal Large Language Model，MLLM），多模态大语言模型对多模态输入的处理效果较好，能够提高目标实体标识的预测准确率；其中，多模态通常指的是来自不同感官或来源的信息，例如视觉、听觉、触觉等，多模态大语言模型是大语言模型的扩展形式，相较于仅能处理对应于文本模态的输入数据的大语言模型，多模态大语言模型不仅能够处理对应于文本模态的输入数据，还能够处理对应于除了文本模态之外的其他模态的输入数据，例如，其他模态包括视觉模态、音频模态或者多种模态组合结果等等。Specifically, the first image features, text features and regional features constitute the multimodal input of the large language model. Therefore, the first large language model can adopt a multimodal large language model (Multimodal Large Language Model, MLLM). The multimodal large language model has a better processing effect on multimodal input and can improve the prediction accuracy of the target entity identification. Among them, multimodality usually refers to information from different senses or sources, such as vision, hearing, touch, etc. The multimodal large language model is an extended form of the large language model. Compared with the large language model that can only process input data corresponding to the text modality, the multimodal large language model can not only process input data corresponding to the text modality, but also process input data corresponding to other modalities except the text modality. For example, other modalities include visual modalities, audio modalities or the results of a combination of multiple modalities, etc.

在一种可能的实现方式中，提示文本可包含指令文本和指代文本，指令文本为提示指令，而指代文本用于指代目标对象，将指代文本作为文本提示，能够进一步准确指代目标对象，从而进一步提高目标实体标识的预测准确率。In one possible implementation, the prompt text may include instruction text and reference text, the instruction text is the prompt instruction, and the reference text is used to refer to the target object. Using the reference text as a text prompt can further accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identification.

在一种可能的实现方式中，对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，具体可以是对目标图像进行多层级特征提取，得到目标图像的多层级视觉特征；分别基于查询掩模以及各个局部掩模，对多层级视觉特征进行掩模池化，得到多个多层级池化特征；分别对各个多层级池化特征进行特征融合，得到多个掩模视觉特征。In a possible implementation, feature extraction is performed on the target image based on the query mask and each local mask to obtain multiple mask visual features. Specifically, multi-level feature extraction is performed on the target image to obtain multi-level visual features of the target image; mask pooling is performed on the multi-level visual features based on the query mask and each local mask to obtain multiple multi-level pooling features; and feature fusion is performed on each multi-level pooling feature to obtain multiple mask visual features.

具体地，掩模池化具体是指在目标图像中，对于查询掩模和各个局部掩模中的任意一个掩模，对于该掩模指代对象所在的局部区域，通过池化操作聚合该局部区域内所有像素点的多层级视觉特征，得到多层级池化特征；掩模池化可分为掩模操作和池化操作，下面先对掩模操作进行详细描述。Specifically, mask pooling refers to that in the target image, for any one of the query mask and each local mask, for the local area where the mask refers to the object, the multi-level visual features of all pixels in the local area are aggregated through the pooling operation to obtain the multi-level pooling features; mask pooling can be divided into mask operation and pooling operation. The mask operation is described in detail below.

首先，对于查询掩模和各个局部掩模中的任意一个掩模，基于多层级视觉特征的层级数量，对该掩模的维度进行多次调整，得到各个层级的视觉特征匹配的目标掩模，其中，该掩模是二值图像，该掩模中的关注区域由像素值为1的像素点构成，而该掩模中的其他区域由像素值为0的像素点构成。例如，假设某个层级的视觉特征的维度为128×128×16，掩模的维度为512×512，在调整维度时，先对掩模的维度缩小至128×128，然后对缩小后的掩模进行通道复制，得到具有16个通道的多通道图像，将该多通道图像定义为目标掩模，该目标掩模的维度为128×128×16，由于该目标掩模的维度与该层级的视觉特征的维度相同，所以该目标掩模与该层级的视觉特征匹配；因此，当多层级视觉特征的层级数量为五个时，可对掩模的维度进行五次调整，能够得到五个层级的视觉特征各自匹配的目标掩模。First, for any one of the query mask and each local mask, the dimension of the mask is adjusted multiple times based on the number of levels of the multi-level visual features to obtain a target mask that matches the visual features of each level, wherein the mask is a binary image, and the focus area in the mask is composed of pixel points with a pixel value of 1, while other areas in the mask are composed of pixel points with a pixel value of 0. For example, assuming that the dimension of the visual features of a certain level is 128×128×16 and the dimension of the mask is 512×512, when adjusting the dimension, the dimension of the mask is first reduced to 128×128, and then the reduced mask is channel-copied to obtain a multi-channel image with 16 channels. The multi-channel image is defined as the target mask, and the dimension of the target mask is 128×128×16. Since the dimension of the target mask is the same as the dimension of the visual features of the level, the target mask matches the visual features of the level; therefore, when the number of levels of the multi-level visual features is five, the dimension of the mask can be adjusted five times to obtain target masks that match the visual features of the five levels.

然后，将该层级的视觉特征与目标掩模进行逐元素相乘，保留落在目标掩模中关注区域内的视觉特征，而落在目标掩模中其他区域内的视觉特征变成0，得到维度为128×128×16的关注特征。Then, the visual features of this level are multiplied element-by-element with the target mask, and the visual features falling in the focus area of the target mask are retained, while the visual features falling in other areas of the target mask become 0, resulting in a focus feature with a dimension of 128×128×16.

下面对池化操作进行详细描述，池化操作可采用平均池化或者最大池化，也可以采用其他池化方式，本公开实施例在此不作限定。The pooling operation is described in detail below. The pooling operation may adopt average pooling or maximum pooling, or may adopt other pooling methods, which are not limited in the embodiments of the present disclosure.

以池化操作采用平均池化为例，对各个层级的视觉特征进行掩模操作能够得到对应的关注特征，假设某个关注特征的维度为128×128×16，该关注特征包含16个通道的维度为128×128的初始特征图，对于初始特征图中的各个位置，取16个通道中对应位置的平均值作为新的特征图中对应位置的特征值，从而生成一个维度为128×128的单通道的池化特征图，对各个层级的视觉特征对应的关注特征进行池化操作后，能够各个层级的视觉特征对应的池化特征图，各个层级的视觉特征对应的池化特征图能够组成多层级池化特征。Taking the pooling operation using average pooling as an example, masking operations can be performed on the visual features of each level to obtain the corresponding attention features. Assuming that the dimension of a certain attention feature is 128×128×16, the attention feature contains an initial feature map with a dimension of 128×128 of 16 channels. For each position in the initial feature map, the average value of the corresponding position in the 16 channels is taken as the feature value of the corresponding position in the new feature map, thereby generating a single-channel pooling feature map with a dimension of 128×128. After performing pooling operations on the attention features corresponding to the visual features of each level, the pooling feature maps corresponding to the visual features of each level can be obtained, and the pooling feature maps corresponding to the visual features of each level can form multi-level pooling features.

其中，分别对各个多层级池化特征进行特征融合，具体可指对于任意一个多层级池化特征，对该多层级池化特征中的所有池化特征图进行特征融合，得到其中一个掩模视觉特征。Among them, feature fusion is performed on each multi-level pooling feature respectively, which specifically refers to that for any multi-level pooling feature, all pooling feature maps in the multi-level pooling feature are feature fused to obtain one of the mask visual features.

基于此，在多层级特征提取中能够提取到目标图像在不同层级的视觉特征，高层级的视觉特征通常比低层级的视觉特征更抽象且更具信息量，多层级视觉特征提供了更丰富和多样的数据表示，能够更全面和有效地学习和表达目标图像，得到合适的多层级视觉特征，然后在掩模池化过程中能够精确捕捉到目标图像中特定区域的像素级视觉信息，实现从目标对象以及各个候选对象所在局部区域中精确提取视觉特征，得到合适的多层级池化特征，然后在特征融合过程中能够整合不同层级的池化特征图，得到合适的掩模视觉特征，实现细粒度视觉理解，能够增强特定区域的像素级视觉信息的完整性，后续在文本预测过程中，能够有效提高对目标图像的像素级细节的理解，从而提高目标实体标识的预测准确率。Based on this, in the multi-level feature extraction, the visual features of the target image at different levels can be extracted. High-level visual features are usually more abstract and more informative than low-level visual features. Multi-level visual features provide richer and more diverse data representations, which can learn and express the target image more comprehensively and effectively, and obtain appropriate multi-level visual features. Then, in the mask pooling process, the pixel-level visual information of the specific area in the target image can be accurately captured, and the visual features can be accurately extracted from the local area where the target object and each candidate object are located, and the appropriate multi-level pooling features can be obtained. Then, in the feature fusion process, the pooling feature maps of different levels can be integrated to obtain the appropriate mask visual features, and fine-grained visual understanding can be achieved. The integrity of the pixel-level visual information in the specific area can be enhanced. In the subsequent text prediction process, the understanding of the pixel-level details of the target image can be effectively improved, thereby improving the prediction accuracy of the target entity identification.

在一种可能的实现方式中，提取目标图像的第一图像特征，具体可以是对目标图像进行多层级特征提取，得到目标图像的多层级视觉特征；将多层级视觉特征输入至第一多层感知器进行映射，得到第一图像特征。In a possible implementation, extracting the first image feature of the target image may specifically involve performing multi-level feature extraction on the target image to obtain multi-level visual features of the target image; and inputting the multi-level visual features into a first multi-layer perceptron for mapping to obtain the first image feature.

基于此，第一多层感知器用于进行多层感知处理，通过多层感知处理对多层级视觉特征进行进一步抽象，能够学习到更高级的数据表示，从而提高目标实体标识的预测准确率；第一多层感知器与第一大语言模型可联合训练。另外，在提取掩模视觉特征以及第一图像特征时，可利用同一个视觉编码器进行多层级特征提取，通过共享视觉编码器的特征映射，能够减小额外的计算和参数开销。Based on this, the first multi-layer perceptron is used to perform multi-layer perception processing, and the multi-level visual features are further abstracted through the multi-layer perception processing, so that a higher-level data representation can be learned, thereby improving the prediction accuracy of the target entity identification; the first multi-layer perceptron and the first large language model can be jointly trained. In addition, when extracting the mask visual features and the first image features, the same visual encoder can be used for multi-level feature extraction. By sharing the feature mapping of the visual encoder, the additional calculation and parameter overhead can be reduced.

在一种可能的实现方式中，参考照图3，图3为本公开实施例提供的确定掩模视觉特征的一种可选的流程示意图，分别对各个多层级池化特征进行特征融合，得到多个掩模视觉特征，具体可以是对于任意一个多层级池化特征，分别对多层级池化特征中各个层级的子特征进行映射，得到多个维度相同的中间特征，将各个中间特征进行特征融合，得到融合特征；分别对各个融合特征进行多层感知处理，得到多个掩模视觉特征。In a possible implementation, referring to FIG3 , FIG3 is an optional flow chart of determining mask visual features provided in an embodiment of the present disclosure, wherein feature fusion is performed on each multi-level pooling feature to obtain multiple mask visual features. Specifically, for any multi-level pooling feature, sub-features of each level in the multi-level pooling feature are mapped to obtain multiple intermediate features of the same dimension, and each intermediate feature is feature fused to obtain a fused feature; multi-layer perception processing is performed on each fused feature to obtain multiple mask visual features.

其中，将各个中间特征进行特征融合，具体可将各个中间特征进行求和，也可将各个中间特征进行拼接，本公开实施例在此不作限定。Among them, the various intermediate features are subjected to feature fusion. Specifically, the various intermediate features may be summed up, or the various intermediate features may be spliced, which is not limited in the embodiments of the present disclosure.

基于此，多层级池化特征中各个层级的子特征为上述的池化特征图，由于不同层级的子特征的维度通常是不同的，所以在特征融合之前，需要先将各个子特征映射为维度相同的中间特征，再对维度相同的中间特征进行特征融合，得到融合特征，能够有效整合不同层级的子特征，通过多层感知处理对融合特征进行进一步抽象，能够学习到更高级的数据表示，最后生成合适的掩模视觉特征，为文本预测提供更为丰富和精确的视觉信息，有助于提高对目标图像的像素级细节的理解，从而提高目标实体标识的预测准确率。Based on this, the sub-features of each level in the multi-level pooling features are the above-mentioned pooling feature maps. Since the dimensions of sub-features at different levels are usually different, before feature fusion, it is necessary to map each sub-feature to an intermediate feature of the same dimension, and then perform feature fusion on the intermediate features of the same dimension to obtain fused features, which can effectively integrate sub-features of different levels. The fused features are further abstracted through multi-layer perception processing, and more advanced data representations can be learned. Finally, suitable mask visual features are generated to provide richer and more accurate visual information for text prediction, which helps to improve the understanding of pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identification.

具体地，当目标实体标识由第一大语言模型生成时，多层感知处理能够将具有视觉信息的融合特征映射为具有语言信息的掩模视觉特征，相当于将融合特征映射到一个与文本嵌入空间相匹配的特征空间中，实现特征在不同表示空间之间的转换，能够有效整合输入第一大语言模型的视觉信息和文本信息，有助于提高第一大语言模型对目标图像的像素级细节的理解，从而提高目标实体标识的预测准确率。Specifically, when the target entity identification is generated by the first largest language model, the multi-layer perception processing can map the fused features with visual information into masked visual features with language information, which is equivalent to mapping the fused features to a feature space that matches the text embedding space, thereby realizing the conversion of features between different representation spaces, and can effectively integrate the visual information and text information input into the first largest language model, which helps to improve the first largest language model's understanding of the pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identification.

另外，各个中间特征可由对应的第一线性层映射得到；而分别对各个融合特征进行多层感知处理，具体可将各个融合特征分别输入至第二多层感知器进行映射，能够得到各个融合特征对应的掩模视觉特征，第一线性层、第二多层感知器以及第一大语言模型可联合训练。In addition, each intermediate feature can be obtained by mapping the corresponding first linear layer; and each fused feature can be processed by multi-layer perception respectively. Specifically, each fused feature can be input into the second multi-layer perceptron for mapping, so as to obtain the mask visual features corresponding to each fused feature. The first linear layer, the second multi-layer perceptron and the first large language model can be trained jointly.

在一种可能的实现方式中，获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，具体可以是获取目标图像以及目标图像的提示标记；对目标图像进行分割，得到各个候选对象对应的局部掩模，基于提示标记在各个局部掩模中确定查询掩模。In a possible implementation, a local mask corresponding to each candidate object in a target image is obtained, and a query mask is determined from multiple local masks. Specifically, a target image and a prompt mark of the target image are obtained; the target image is segmented to obtain a local mask corresponding to each candidate object, and a query mask is determined from each local mask based on the prompt mark.

其中，目标图像包括多个候选对象，提示标记用于指示多个候选对象中被选择的目标对象，提示标记是指目标图像上绘制的标记，例如，提示标记可为目标图像上绘制的点、线、框、涂抹区等等。Among them, the target image includes multiple candidate objects, and the prompt mark is used to indicate the target object selected from the multiple candidate objects. The prompt mark refers to a mark drawn on the target image. For example, the prompt mark can be a point, line, box, smear area, etc. drawn on the target image.

具体地，对目标图像进行分割，具体可先预测出目标图像中各个像素点的类别分布概率，类别分布概率包括像素点属于各个候选对象的匹配概率值，对于任意一个像素点，将最高匹配概率值的候选对象确定为该像素点匹配的候选对象，各个候选对象所在的局部区域分别由所有匹配的像素点构成，然后能够基于各个像素点所匹配的候选对象，在目标图像中准确分割出各个候选对象所在的局部区域，然后能够基于各个候选对象所在的局部区域，准确确定各个候选对象对应的局部掩模。Specifically, the target image is segmented. Specifically, the category distribution probability of each pixel in the target image can be predicted first. The category distribution probability includes the matching probability value of the pixel belonging to each candidate object. For any pixel, the candidate object with the highest matching probability value is determined as the candidate object matched by the pixel. The local area where each candidate object is located is composed of all the matching pixels, respectively. Then, based on the candidate object matched by each pixel, the local area where each candidate object is located can be accurately segmented in the target image. Then, based on the local area where each candidate object is located, the local mask corresponding to each candidate object can be accurately determined.

例如，对于任意一个候选对象，创建一个与目标图像相同的初始图像，在初始图像中，将该候选对象所在的局部区域内的像素点的像素值赋值为1，将在初始图像中的其他区域内的像素点的像素值赋值为0，得到该候选对象对应的局部掩模；也可以通过其他方式确定候选对象对应的局部掩模，本公开实施例在此不作限定。For example, for any candidate object, an initial image identical to the target image is created. In the initial image, the pixel values of the pixels in the local area where the candidate object is located are assigned to 1, and the pixel values of the pixels in other areas of the initial image are assigned to 0, to obtain the local mask corresponding to the candidate object. The local mask corresponding to the candidate object can also be determined by other methods, which are not limited to the embodiments of the present disclosure.

基于此，通过对目标图像进行分割，能够准确得到各个候选对象对应的局部掩模，提示标记可由相关人员绘制，提示标记能够表征相关人员的指代意图，所以提示标记指示的目标对象具体是被相关人员选择的对象，基于提示标记在局部掩模中确定查询掩模，使得查询掩模也能够表征相关人员的指代意图，通过将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地表征指代目标对象，从而进一步提高目标实体标识的预测准确率。Based on this, by segmenting the target image, the local mask corresponding to each candidate object can be accurately obtained. The prompt mark can be drawn by the relevant personnel, and the prompt mark can represent the reference intention of the relevant personnel. Therefore, the target object indicated by the prompt mark is specifically the object selected by the relevant personnel. The query mask is determined in the local mask based on the prompt mark, so that the query mask can also represent the reference intention of the relevant personnel. By using the pixel-level query mask as a visual cue, the reference target object can be efficiently, flexibly and accurately represented, thereby further improving the prediction accuracy of the target entity identification.

具体地，参考图4，图4为本公开实施例提供的生成掩模的一种可选的流程示意图，可将目标图像以及提示标记输入至目标掩模生成模型进行掩模预测，生成局部掩模以及查询掩模，例如，目标掩模生成模型可采用Segment Anything Model（SAM）、Fast SegmentAnything Model（FastSAM）等模型，本公开实施例在此不作限定。Specifically, referring to Figure 4, Figure 4 is an optional process diagram for generating a mask provided in an embodiment of the present disclosure. The target image and the prompt mark can be input into the target mask generation model for mask prediction to generate a local mask and a query mask. For example, the target mask generation model can adopt Segment Anything Model (SAM), Fast SegmentAnything Model (FastSAM) and other models, which are not limited in the embodiments of the present disclosure.

在一种可能的实现方式中，基于提示标记在各个局部掩模中确定查询掩模，具体可以是当提示标记为标记点时，基于标记点与各个局部掩模之间的位置关系，在各个局部掩模中确定查询掩模；In a possible implementation, the query mask is determined in each local mask based on the hint mark. Specifically, when the hint mark is a mark point, the query mask is determined in each local mask based on a positional relationship between the mark point and each local mask.

或者，当提示标记为标记框时，基于标记框与各个局部掩模的掩模边界之间的匹配程度，在各个局部掩模中确定查询掩模；Alternatively, when the hint is marked as a marker box, the query mask is determined in each local mask based on a degree of matching between the marker box and a mask boundary of each local mask;

或者，当提示标记为标记区域时，基于标记区域与各个局部掩模之间的匹配程度，在各个局部掩模中确定查询掩模。Alternatively, when the hint is marked as a labeled region, the query mask is determined in each local mask based on the matching degree between the labeled region and each local mask.

其中，标记点的数量可以是一个或者多个，当标记点的数量是一个时，标记点位于查询掩模的掩模边界内；当标记点的数量是多个时，可将关注区域内包含最多标记点的局部掩模确定为查询掩模，也可通过其他方式确定查询掩模，本公开实施例在此不作限定。Among them, the number of marking points can be one or more. When the number of marking points is one, the marking point is located within the mask boundary of the query mask; when the number of marking points is multiple, the local mask containing the most marking points in the focus area can be determined as the query mask, and the query mask can also be determined by other methods, which is not limited in the embodiments of the present disclosure.

基于此，在处理标记点时，由于标记点为目标图像上目标对象的所在区域内的像素点，通过确定标记点与各个局部掩模之间的位置关系，能够确定标记点是否位于局部掩模的掩模边界内，即确定局部掩模中的关注区域是否包含标记点，然后可将关注区域内包含标记点的局部掩模确定为查询掩模；在处理标记框时，由于标记框的框选区域通常能够覆盖目标对象的所在区域，所以可计算标记框与各个局部掩模的掩模边界之间的匹配程度，然后将匹配程度最高的局部掩模确定为查询掩模；类似地，在处理标记区域时，标记区域相当于涂鸦，由于标记区域通常能够覆盖目标对象的所在区域，所以也可计算匹配程度，并将匹配程度最高的局部掩模确定为查询掩模；因此，上述三种处理方式都使得查询掩模能够有效覆盖目标对象的所在区域，进而确保查询掩模能够高效、灵活且准确地表征指代目标对象，从而进一步提高目标实体标识的预测准确率。Based on this, when processing the marking point, since the marking point is a pixel point in the area where the target object is located on the target image, by determining the positional relationship between the marking point and each local mask, it is possible to determine whether the marking point is located within the mask boundary of the local mask, that is, to determine whether the focus area in the local mask contains the marking point, and then the local mask containing the marking point in the focus area can be determined as the query mask; when processing the marking box, since the box selection area of the marking box can usually cover the area where the target object is located, the matching degree between the marking box and the mask boundary of each local mask can be calculated, and then the local mask with the highest matching degree can be determined as the query mask; similarly, when processing the marking area, the marking area is equivalent to graffiti, and since the marking area can usually cover the area where the target object is located, the matching degree can also be calculated, and the local mask with the highest matching degree can be determined as the query mask; therefore, the above three processing methods all enable the query mask to effectively cover the area where the target object is located, thereby ensuring that the query mask can efficiently, flexibly and accurately represent the target object, thereby further improving the prediction accuracy of the target entity identification.

需要说明的是，提示标记可通过多种方式确定，例如，在显示界面中显示了目标图像以及图像标记控件，响应于图像标记控件的交互，确定提示标记；图像标记控件具体可包括图像点击控件、图像框选控件、图像涂抹控件等等；示例性地，当图像标记控件为图像点击控件时，响应于图像点击控件的交互，能够检测相关人员在目标图像上点击的像素点，进而将该像素点确定为标记点；当图像标记控件为图像框选控件时，响应于图像框选控件的交互，能够检测相关人员在目标图像上绘制的边界框，进而将该边界框确定为标记框；当图像标记控件为图像涂抹控件时，响应于图像涂抹控件的交互，能够检测相关人员在目标图像上绘制的涂抹区域，进而将该涂抹区域确定为标记区域。It should be noted that the prompt mark can be determined in a variety of ways. For example, the target image and the image marking control are displayed in the display interface, and the prompt mark is determined in response to the interaction of the image marking control. The image marking control may specifically include an image click control, an image frame selection control, an image smear control, and the like. Exemplarily, when the image marking control is an image click control, in response to the interaction of the image click control, it can detect the pixel point clicked by the relevant person on the target image, and then determine the pixel point as the marking point; when the image marking control is an image frame selection control, in response to the interaction of the image frame selection control, it can detect the bounding box drawn by the relevant person on the target image, and then determine the bounding box as the marking box; when the image marking control is an image smear control, in response to the interaction of the image smear control, it can detect the smear area drawn by the relevant person on the target image, and then determine the smear area as the marking area.

又例如，在显示界面中显示了文本输入框，响应于文本输入框的交互，能够获取相关人员在文本输入框输入的内容，进而基于文本输入框输入的内容确定提示标记。示例性地，假设目标图像表示为一个x行y列的二维矩阵，那么文本输入框输入的内容可为“目标图像中位于第x'行第y'列的像素点”，进而基于文本输入框输入的内容，将目标图像中位于第x'行第y'列的像素点确定为标记点，其中，x'≤x，y'≤y。For another example, a text input box is displayed in the display interface. In response to the interaction of the text input box, the content input by the relevant personnel in the text input box can be obtained, and then the prompt mark is determined based on the content input in the text input box. Exemplarily, assuming that the target image is represented as a two-dimensional matrix of x rows and y columns, the content input in the text input box can be "the pixel point located at the x'th row and y'th column in the target image", and then based on the content input in the text input box, the pixel point located at the x'th row and y'th column in the target image is determined as the mark point, where x'≤x, y'≤y.

在一种可能的实现方式中，基于标记区域与各个局部掩模之间的匹配程度，在各个局部掩模中确定查询掩模，具体可以是确定标记区域与各个局部掩模之间的交并比，即匹配程度具体为交并比，然后将交并比最大的局部掩模确定为查询掩模；或者，确定标记区域与各个局部掩模之间的中心距离，即匹配程度具体为中心距离，然后将中心距离最大的局部掩模确定为查询掩模；也可采用其他方式确定查询掩模，本公开实施例在此不作限定。In a possible implementation, based on the degree of matching between the marked area and each local mask, a query mask is determined in each local mask. Specifically, the intersection-and-union ratio between the marked area and each local mask is determined, that is, the matching degree is specifically the intersection-and-union ratio, and then the local mask with the largest intersection-and-union ratio is determined as the query mask; or, the center distance between the marked area and each local mask is determined, that is, the matching degree is specifically the center distance, and then the local mask with the largest center distance is determined as the query mask; other methods may also be used to determine the query mask, which is not limited in the embodiments of the present disclosure.

具体地，与标记区域的处理方式类似，标记框与各个局部掩模的掩模边界之间的匹配程度也可通过计算交并比或者中心距离等方式确定。Specifically, similar to the processing method of the marked area, the matching degree between the marked box and the mask boundary of each local mask can also be determined by calculating the intersection-over-union ratio or the center distance.

在一种可能的实现方式中，目标实体标识由第一大语言模型生成，第一大语言模型通过以下步骤训练得到：获取样本图像以及样本图像中样本对象对应的第二掩模，对样本图像进行分割，得到样本图像中各个视觉对象对应的第一掩模，其中，样本对象为多个视觉对象中的一个对象；提取样本图像的第二图像特征，对样本图像分别进行基于第二掩模以及各个第一掩模的特征提取，得到多个样本视觉特征，提取第二掩模以及各个第一掩模对应的样本位置特征；将第二图像特征、样本视觉特征以及样本位置特征拼接后输入至第一大语言模型进行文本预测，生成预测概率分布，其中，预测概率分布用于确定样本对象的实体标识；获取样本图像所链接的样本实体的第一实体标识，基于预测概率分布与第一实体标识确定模型损失，基于模型损失训练第一大语言模型。In a possible implementation, the target entity identifier is generated by a first large language model, and the first large language model is trained through the following steps: obtaining a sample image and a second mask corresponding to a sample object in the sample image, segmenting the sample image, and obtaining a first mask corresponding to each visual object in the sample image, wherein the sample object is one of multiple visual objects; extracting a second image feature of the sample image, performing feature extraction based on the second mask and each first mask on the sample image to obtain multiple sample visual features, and extracting sample position features corresponding to the second mask and each first mask; splicing the second image feature, the sample visual feature, and the sample position feature and inputting them into the first large language model for text prediction to generate a predicted probability distribution, wherein the predicted probability distribution is used to determine the entity identifier of the sample object; obtaining a first entity identifier of a sample entity linked to the sample image, determining a model loss based on the predicted probability distribution and the first entity identifier, and training the first large language model based on the model loss.

值得注意的是，与目标图像类似，样本图像是指需要进行视觉实体链接的图像；与局部掩模类似，各个第一掩模分别用于指示对应的视觉对象，第一掩模能够在样本图像中指定相应视觉对象所在的局部区域；与查询掩模类似，第二掩模用于指示样本对象，第二掩模能够在样本图像中指定样本对象所在的局部区域；与目标对象类似，样本对象相当于从多个视觉对象中选择的对象；与掩模视觉特征类似，样本视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息；与掩模位置特征类似，样本位置特征能够捕捉到相应对象所在局部区域的像素级位置信息。It is worth noting that, similar to the target image, the sample image refers to the image that needs to be visually linked; similar to the local mask, each first mask is used to indicate the corresponding visual object, and the first mask can specify the local area where the corresponding visual object is located in the sample image; similar to the query mask, the second mask is used to indicate the sample object, and the second mask can specify the local area where the sample object is located in the sample image; similar to the target object, the sample object is equivalent to an object selected from multiple visual objects; similar to the mask visual feature, the sample visual feature can capture the pixel-level visual information of the local area where the corresponding object is located; similar to the mask position feature, the sample position feature can capture the pixel-level position information of the local area where the corresponding object is located.

其中，样本实体用于指示样本对象，样本实体可为知识库中实体的标签，在知识库中，每个实体通常被设计为表示一个独一无二的对象，并对应一个全局唯一的标签，不同的标签能够指示不同的实体，例如，样本实体可为e=Q10000XX，e为实体（entity），Q10000XX为标签。Among them, the sample entity is used to indicate the sample object. The sample entity can be the label of the entity in the knowledge base. In the knowledge base, each entity is usually designed to represent a unique object and corresponds to a globally unique label. Different labels can indicate different entities. For example, the sample entity can be e=Q10000XX, where e is the entity and Q10000XX is the label.

其中，第一实体标识可包括一个样本标识符或者多个样本标识符；当第一实体标识为由多个样本标识符组成的序列时，大语言模型会生成各个样本标识符对应的预测概率分布，各个预测概率分布均包括各个候选标识符的预测概率值，可基于最高预测概率值的候选标识符确定样本对象的实体标识；当第一实体标识为单个样本标识符时，预测概率分布包括各个候选标识符的预测概率值。Among them, the first entity identifier may include one sample identifier or multiple sample identifiers; when the first entity identifier is a sequence composed of multiple sample identifiers, the large language model will generate a prediction probability distribution corresponding to each sample identifier, each prediction probability distribution includes a prediction probability value of each candidate identifier, and the entity identifier of the sample object can be determined based on the candidate identifier with the highest prediction probability value; when the first entity identifier is a single sample identifier, the prediction probability distribution includes the prediction probability value of each candidate identifier.

具体地，第一实体标识用于指示样本对象的样本实体，样本标识符可为分词或者分词的索引，样本标识符也可为其他形式，只要确保第一实体标识能够指示样本对象的样本实体即可，本公开实施例在此不作限定。Specifically, the first entity identifier is used to indicate the sample entity of the sample object. The sample identifier may be a word segment or an index of a word segment. The sample identifier may also be in other forms, as long as the first entity identifier can indicate the sample entity of the sample object. The embodiments of the present disclosure are not limited here.

例如，当样本标识符为分词时，候选标识符也为分词，所有分词均记录在词表中,预测概率分布包括词表中的各个分词的预测概率值，例如，第一实体标识为[_course][olf][_G]，该第一实体标识包括3个样本标识符，第一样本标识符为[_course]，第二个样本标识符为[olf]，样本标识符为[_G]。For example, when the sample identifier is a segmentation word, the candidate identifier is also a segmentation word, and all segmentations are recorded in the vocabulary. The predicted probability distribution includes the predicted probability value of each segmentation word in the vocabulary. For example, the first entity identifier is [_course][olf][_G], and the first entity identifier includes 3 sample identifiers, the first sample identifier is [_course], the second sample identifier is [olf], and the sample identifier is [_G].

又例如，当样本标识符为分词的索引时，候选标识符也为分词的索引，所有分词以及对应的索引均可记录在词表中，预测概率分布包括词表中的各个分词的索引的预测概率值，假设词表包括100个分词以及对应的索引，则预测概率分布包括100个预测概率值；其中，索引用于指示对应分词在词嵌入矩阵中的位置，词嵌入矩阵包括词表中的各个分词在词嵌入空间中的词嵌入向量，假设第一实体标识为[10,40,5]，该第一实体标识包括3个样本标识符，分别为10、40和5，样本标识符10用于指示词嵌入矩阵中的第10个位置，样本标识符40用于指示词嵌入矩阵中的第40个位置，样本标识符5用于指示词嵌入矩阵中的第5个位置，此时，第一实体标识能够以整数序列的形式存在，该整数序列又可称为整数代码，实现了通过紧凑的整数代码来表示知识库中的每个实体。For another example, when the sample identifier is the index of a word segment, the candidate identifier is also the index of the word segment, all word segments and the corresponding indexes can be recorded in the vocabulary, and the predicted probability distribution includes the predicted probability values of the indexes of each word segment in the vocabulary. Assuming that the vocabulary includes 100 word segments and corresponding indexes, the predicted probability distribution includes 100 predicted probability values; wherein the index is used to indicate the position of the corresponding word segment in the word embedding matrix, and the word embedding matrix includes the word embedding vectors of each word segment in the word embedding space. Assuming that the first entity identifier is [10,40,5], the first entity identifier includes 3 sample identifiers, namely 10, 40 and 5, respectively. Sample identifier 10 is used to indicate the 10th position in the word embedding matrix, sample identifier 40 is used to indicate the 40th position in the word embedding matrix, and sample identifier 5 is used to indicate the 5th position in the word embedding matrix. At this time, the first entity identifier can exist in the form of an integer sequence, which can also be called an integer code, thereby realizing the representation of each entity in the knowledge base by a compact integer code.

基于此，通过获取样本图像，然后通过对样本图像进行分割得到样本图像中各个视觉对象对应的第一掩模，还获取样本图像中样本对象对应的第二掩模，进而通过特征提取得到第二掩模以及各个第一掩模对应的样本视觉特征，并提取第二掩模以及各个第一掩模对应的样本位置特征，通过第一大语言模型进行文本预测，能够生成用于确定实体标识的预测概率分布，然后基于样本实体的第一实体标识确定样本概率分布，将样本概率分布作为标签数据，然后根据预测概率分布和样本概率分布之间的差异确定模型损失，基于模型损失对第一大语言模型进行监督学习，通过迭代训练缩小模型损失，使得第一大语言模型能够提高对目标图像的全局特征以及像素级细节的理解，从而提高目标实体标识的预测准确率。Based on this, by obtaining a sample image, and then obtaining a first mask corresponding to each visual object in the sample image by segmenting the sample image, and also obtaining a second mask corresponding to the sample object in the sample image, and then obtaining the second mask and the sample visual features corresponding to each first mask by feature extraction, and extracting the second mask and the sample position features corresponding to each first mask, and performing text prediction through the first large language model, it is possible to generate a predicted probability distribution for determining an entity identifier, and then determine the sample probability distribution based on the first entity identifier of the sample entity, use the sample probability distribution as label data, and then determine the model loss based on the difference between the predicted probability distribution and the sample probability distribution, and perform supervised learning on the first large language model based on the model loss, and reduce the model loss through iterative training, so that the first large language model can improve the understanding of the global features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier.

在一种可能的实现方式中，样本图像、第二掩模以及第一实体标识均从数据集中获取，获取样本图像以及样本图像中样本对象对应的第二掩模之前，标识生成方法还包括：获取多个原始图像以及各个原始图像对应的查询文本，根据各个原始图像以及对应的查询文本，分别确定各个原始图像对应的识别信息；获取多个候选实体，基于各个识别信息，分别在多个候选实体中确定各个原始图像对应的链接实体；基于各个原始图像以及对应的查询文本，分别确定各个原始图像的标注掩模；将各个原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集。In a possible implementation, the sample image, the second mask and the first entity identifier are all obtained from a data set. Before obtaining the sample image and the second mask corresponding to the sample object in the sample image, the identifier generation method also includes: obtaining multiple original images and query texts corresponding to each original image, and determining the identification information corresponding to each original image according to each original image and the corresponding query text; obtaining multiple candidate entities, and determining the linked entity corresponding to each original image in the multiple candidate entities based on each identification information; determining the annotation mask of each original image based on each original image and the corresponding query text; and associating and storing each original image, the corresponding annotation mask and the corresponding linked entity in the data set.

其中，原始图像是指需要进行视觉实体链接的图像，原始图像通常包括多个对象，但只需要对原始图像中的关注对象进行视觉实体链接，与目标对象类似，关注对象相当于从原始图像中的多个对象中选择的对象，查询文本用于提示识别出对应的原始图像中的关注对象，查询文本能够表征指代关注对象的意图，标注掩模用于指示对应的关注对象，标注掩模能够在原始图像中指定关注对象所在的局部区域。Among them, the original image refers to the image that needs to be visually linked. The original image usually includes multiple objects, but only the focus object in the original image needs to be visually linked. Similar to the target object, the focus object is equivalent to an object selected from multiple objects in the original image. The query text is used to prompt the identification of the corresponding focus object in the original image. The query text can represent the intention of referring to the focus object. The annotation mask is used to indicate the corresponding focus object. The annotation mask can specify the local area where the focus object is located in the original image.

具体地，各个候选实体均可为知识库中实体的标签，在知识库中，每个实体通常被设计为表示一个独一无二的对象，并对应一个全局唯一的标签，例如，某个候选实体可为e=Q10000XX，另一个候选实体可为e=Q20000XX，e为实体（entity），Q10000XX和Q20000XX均为标签；因此，链接实体也可为知识库中实体的标签。Specifically, each candidate entity can be a label of an entity in the knowledge base. In the knowledge base, each entity is usually designed to represent a unique object and corresponds to a globally unique label. For example, a candidate entity can be e=Q10000XX, and another candidate entity can be e=Q20000XX, where e is the entity, and Q10000XX and Q20000XX are both labels; therefore, the linked entity can also be a label of an entity in the knowledge base.

其中，样本图像从多个原始图像中采样得到，第二掩模为样本图像对应的标注掩模，样本实体为样本图像所链接的链接实体。The sample image is sampled from a plurality of original images, the second mask is a labeling mask corresponding to the sample image, and the sample entity is a link entity linked to the sample image.

基于此，根据原始图像以及对应的查询文本所提供额外的语义上下文，能够确定准确的识别信息，进而基于识别信息，能够确定准确的链接实体，原始图像、对应的标注掩模以及对应的链接实体之间存在关联性，因此可将原始图像、对应的标注掩模以及对应的链接实体作为一组训练样本，并将原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集，实现将原始图像与对应的链接实体进行链接，使得标注掩模所指示的关注对象与对应的链接实体存在链接关系；由于数据集通常包括多组训练样本，在训练过程中可从数据集中采样出训练样本，然后将被采样得到的训练样本中的原始图像确定为样本图像，并将被采样得到的训练样本中的标注掩模确定为第二掩模，以及将被采样得到的训练样本中的链接实体确定为样本实体，确保采样得到的样本图像、第二掩模以及样本实体之间存在关联性，从而确保训练过程的有效进行。Based on this, according to the additional semantic context provided by the original image and the corresponding query text, accurate recognition information can be determined, and then based on the recognition information, accurate linked entities can be determined. There is a correlation between the original image, the corresponding annotation mask and the corresponding linked entity. Therefore, the original image, the corresponding annotation mask and the corresponding linked entity can be used as a group of training samples, and the original image, the corresponding annotation mask and the corresponding linked entity are associated and stored in the data set to achieve linking the original image with the corresponding linked entity, so that the object of interest indicated by the annotation mask has a link relationship with the corresponding linked entity; since the data set usually includes multiple groups of training samples, training samples can be sampled from the data set during the training process, and then the original image in the sampled training sample is determined as the sample image, the annotation mask in the sampled training sample is determined as the second mask, and the linked entity in the sampled training sample is determined as the sample entity, ensuring that there is a correlation between the sampled sample image, the second mask and the sample entity, thereby ensuring the effective progress of the training process.

具体地，从数据集中采样出训练样本的方式可为随机采样、均匀采样等等，本公开实施例在此不作限定；在训练过程中，可以先利用从数据集采集得到的训练样本对第一大语言模型进行预训练，使得第一大语言模型学习实体以及实体标识的知识，有助于在推理过程中生成有效的目标实体标识。然后利用下游场景的训练样本对第一大语言模型进行微调，以提高第一大语言模型进行细粒度视觉实体链接的能力。Specifically, the method of sampling training samples from the data set can be random sampling, uniform sampling, etc., which is not limited in the embodiments of the present disclosure; during the training process, the first language model can be pre-trained using the training samples collected from the data set, so that the first language model learns the knowledge of entities and entity identifiers, which helps to generate effective target entity identifiers during the reasoning process. Then, the first language model is fine-tuned using the training samples of the downstream scene to improve the ability of the first language model to perform fine-grained visual entity linking.

具体地，链接实体可通过多种方式确定，下面对链接实体的第一种确定方式进行详细描述。Specifically, the link entity may be determined in a variety of ways, and the first way of determining the link entity is described in detail below.

在一种可能的实现方式中，识别信息为识别编码结果，根据各个原始图像以及对应的查询文本，分别确定各个原始图像对应的识别信息，具体可以是分别将各个原始图像以及对应的查询文本输入至多模态编码模型进行编码，得到各个原始图像对应的识别编码结果；其中，多模态编码模型可为对比语言图像预训练模型（Contrastive Language–ImagePre-training，CLIP）、路径语言图像模型（Pathways Language and Image model，PaLI）等等，本公开实施例在此不作限定。In one possible implementation, the recognition information is a recognition coding result. According to each original image and the corresponding query text, the recognition information corresponding to each original image is determined respectively. Specifically, each original image and the corresponding query text can be input into a multimodal coding model for encoding to obtain a recognition coding result corresponding to each original image. The multimodal coding model can be a contrastive language-image pre-training model (Contrastive Language-Image Pre-training, CLIP), a path language image model (Pathways Language and Image model, PaLI), etc., and the embodiments of the present disclosure are not limited thereto.

然后，基于各个识别信息，分别在多个候选实体中确定各个原始图像对应的链接实体，具体可以是将知识库中各个候选实体的候选名称文本以及对应的候选图像，输入至多模态编码模型进行编码，得到各个候选实体对应的候选编码结果；对于任意一个原始图像，分别确定原始图像对应的识别编码结果与各个候选编码结果之间的相似度，将相似度最高的候选编码结果确定为目标编码结果，将目标编码结果对应的候选实体确定为原始图像对应的链接实体；基于此，能够快速且准确确定原始图像对应的链接实体。Then, based on each piece of recognition information, the linked entity corresponding to each original image is determined from multiple candidate entities. Specifically, the candidate name text of each candidate entity in the knowledge base and the corresponding candidate image are input into the multimodal coding model for encoding to obtain the candidate coding result corresponding to each candidate entity. For any original image, the similarity between the recognition coding result corresponding to the original image and each candidate coding result is determined respectively, and the candidate coding result with the highest similarity is determined as the target coding result, and the candidate entity corresponding to the target coding result is determined as the linked entity corresponding to the original image. Based on this, the linked entity corresponding to the original image can be quickly and accurately determined.

需要说明的是，知识库除了存储候选实体的标签以外，还会存储候选实体的相关文本以及对应的候选图像，候选图像用于展示候选实体，候选实体的相关文本包括候选实体的名称文本以及描述文本，名称文本用于指示实体且具有可读性，描述文本用于描述候选实体。It should be noted that in addition to storing the labels of candidate entities, the knowledge base also stores the relevant text of the candidate entities and the corresponding candidate images. The candidate images are used to display the candidate entities. The relevant text of the candidate entities includes the name text and description text of the candidate entities. The name text is used to indicate the entity and is readable, and the description text is used to describe the candidate entity.

下面对链接实体的第二种确定方式进行详细描述。The second method for determining the link entity is described in detail below.

在另一种可能的实现方式中，识别信息为识别名称文本，根据各个原始图像以及对应的查询文本，分别确定各个原始图像对应的识别信息，具体可以是分别将各个原始图像以及对应的查询文本输入至多模态编码模型进行编码，得到各个原始图像对应的识别编码结果；对识别编码结果进行解码，得到原始图像对应的识别名称文本。In another possible implementation, the identification information is an identification name text. According to each original image and the corresponding query text, the identification information corresponding to each original image is determined respectively. Specifically, each original image and the corresponding query text can be input into a multimodal coding model for encoding to obtain an identification coding result corresponding to each original image; the identification coding result is decoded to obtain the identification name text corresponding to the original image.

然后，基于各个识别信息，分别在多个候选实体中确定各个原始图像对应的链接实体，具体可以是对于任意一个原始图像，基于原始图像对应的识别名称文本，在知识库中各个候选实体的候选名称文本中检索出一致的目标名称文本，将目标名称文本对应的候选实体确定为原始图像对应的链接实体；基于此，能够准确确定原始图像对应的链接实体。Then, based on each piece of recognition information, the linked entity corresponding to each original image is determined among multiple candidate entities. Specifically, for any original image, based on the recognition name text corresponding to the original image, a consistent target name text is retrieved from the candidate name texts of each candidate entity in the knowledge base, and the candidate entity corresponding to the target name text is determined as the linked entity corresponding to the original image. Based on this, the linked entity corresponding to the original image can be accurately determined.

需要说明的是，链接实体还可以通过其他方式确定，本公开实施例在此不作限定。It should be noted that the link entity may also be determined in other ways, which are not limited in the embodiments of the present disclosure.

在一种可能的实现方式中，基于各个原始图像以及对应的查询文本，分别确定各个原始图像的标注掩模，具体可以是将各个查询文本分别输入至第二大语言模型进行文本预测，生成各个原始图像对应的概括文本；分别基于各个原始图像和对应的概括文本进行对象检测，生成各个原始图像对应的原始边界框，其中，原始边界框用于指示对应的关注对象；分别将各个原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，生成各个原始图像对应的标注掩模。In a possible implementation, based on each original image and the corresponding query text, the annotation mask of each original image is determined respectively. Specifically, each query text can be input into the second largest language model for text prediction to generate a summary text corresponding to each original image; object detection is performed based on each original image and the corresponding summary text to generate an original bounding box corresponding to each original image, wherein the original bounding box is used to indicate the corresponding object of interest; each original image and the corresponding original bounding box is input into the first mask generation model for mask prediction to generate an annotation mask corresponding to each original image.

其中，第二大语言模型属于大语言模型，第二大语言模型用于处理提取文本的指代表达式的任务，第二大语言模型能够基于查询文本生成对应的概括文本，概括文本能够描述关注对象的位置或者与其他对象之间的关系，例如，假设查询文本为“椅子上放置的棕色物品是什么”，对应的概括文本可为“椅子上的棕色物品”。Among them, the second largest language model belongs to the large language model. The second largest language model is used to process the task of extracting referential expressions from text. The second largest language model can generate corresponding summary text based on the query text. The summary text can describe the location of the object of interest or the relationship with other objects. For example, assuming the query text is "What is the brown object placed on the chair", the corresponding summary text can be "the brown object on the chair".

基于此，基于原始图像和对应的概括文本进行对象检测，能够生成准确的原始边界框，使得原始边界框能够覆盖关注对象的所在区域，然后将原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，通过原始边界框提供准确的位置提示，能够提高第一掩模生成模型生成标注掩模的质量，从而有效提高标注掩模的标注成功率。Based on this, object detection is performed based on the original image and the corresponding summarized text, and an accurate original bounding box can be generated so that the original bounding box can cover the area where the object of interest is located. The original image and the corresponding original bounding box are then input into the first mask generation model for mask prediction. The original bounding box provides accurate position hints, which can improve the quality of the annotation mask generated by the first mask generation model, thereby effectively improving the annotation success rate of the annotation mask.

具体地，参考照图5，图5为本公开实施例提供的更新数据集的一种可选的流程示意图。Specifically, referring to FIG. 5 , FIG. 5 is a schematic diagram of an optional process of updating a data set provided in an embodiment of the present disclosure.

首先，根据原始图像和对应的查询文本确定识别信息，基于识别信息确定对应的链接实体；First, identification information is determined according to the original image and the corresponding query text, and the corresponding link entity is determined based on the identification information;

然后，将查询文本输入至第二大语言模型进行文本预测，生成概括文本，将原始图像和对应的概括文本输入至对象预测模型进行对象检测，生成准确的原始边界框；其中，对象检测模型可采用Grounding DINO模型，也可采用其他模型，本公开实施例在此不作限定。Then, the query text is input into the second largest language model for text prediction to generate a summarized text, and the original image and the corresponding summarized text are input into the object prediction model for object detection to generate an accurate original bounding box; wherein, the object detection model may adopt the Grounding DINO model or other models, which is not limited in the embodiments of the present disclosure.

然后，将原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，生成标注掩模，其中，第一掩模生成模型可采用Segment Anything Model（SAM）、FastSegment Anything Model（FastSAM）等模型，本公开实施例在此不作限定。Then, the original image and the corresponding original bounding box are input into the first mask generation model for mask prediction to generate a labeled mask, wherein the first mask generation model may adopt a Segment Anything Model (SAM), a FastSegment Anything Model (FastSAM) and other models, which are not limited in the embodiments of the present disclosure.

然后，将原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集。Then, the original image, the corresponding annotation mask, and the corresponding linked entity are associated and stored in the dataset.

在一种可能的实现方式中，参照图6，图6为本公开实施例提供的剔除原始图像的一种可选的架构示意图，分别将各个原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，生成各个原始图像对应的标注掩模之后，标识生成方法还包括：获取各个链接实体对应的参考名称文本，其中，参考名称文本用于指示参考实体的名称，参考实体在知识库中的层级高于链接实体在知识库中的层级；分别将各个原始图像和对应的参考名称文本输入至第二掩模生成模型进行掩模预测，生成各个原始图像对应的参考掩模；确定各个标注掩模与对应的参考掩模之间的匹配程度，得到各个标注掩模对应的目标匹配度；当目标匹配度小于预设的匹配度阈值时，剔除目标匹配度对应的原始图像。In a possible implementation, referring to FIG. 6 , FIG. 6 is a schematic diagram of an optional architecture for removing original images provided in an embodiment of the present disclosure. After each original image and the corresponding original bounding box are respectively input into a first mask generation model for mask prediction and an annotation mask corresponding to each original image is generated, the identification generation method further includes: obtaining a reference name text corresponding to each linked entity, wherein the reference name text is used to indicate the name of the reference entity, and the level of the reference entity in the knowledge base is higher than the level of the linked entity in the knowledge base; each original image and the corresponding reference name text are respectively input into a second mask generation model for mask prediction to generate a reference mask corresponding to each original image; determining the degree of matching between each annotation mask and the corresponding reference mask to obtain a target matching degree corresponding to each annotation mask; when the target matching degree is less than a preset matching degree threshold, removing the original image corresponding to the target matching degree.

其中，知识库的类型可为知识图谱，知识库中不同的实体可存在层级关系，假设知识库包括名称为“哺乳动物”的实体e=Q73XX，知识库还包括名称为“猫”的实体e=Q3009XX，由于“猫”属于“哺乳动物”，所以可在知识库中设定实体e=Q73XX与实体e=Q3009XX存在层级关系，且实体e=Q73XX的层级高于实体e=Q3009XX；因此，假设链接实体为实体e=Q3009XX，由于实体e=Q73XX的层级高于实体e=Q3009XX，可将实体e=Q73XX确定为参考实体，以及将“哺乳动物”确定为参考名称文本，也可将更高层级的实体确定为参考实体，本公开实施例在此不作限定。Among them, the type of the knowledge base may be a knowledge graph, and different entities in the knowledge base may have a hierarchical relationship. Suppose the knowledge base includes an entity e=Q73XX named "mammal", and the knowledge base also includes an entity e=Q3009XX named "cat". Since "cat" belongs to "mammal", it can be set in the knowledge base that entity e=Q73XX and entity e=Q3009XX have a hierarchical relationship, and the level of entity e=Q73XX is higher than that of entity e=Q3009XX; therefore, assuming that the linked entity is entity e=Q3009XX, since the level of entity e=Q73XX is higher than that of entity e=Q3009XX, entity e=Q73XX can be determined as a reference entity, and "mammal" can be determined as a reference name text. Entities at a higher level can also be determined as reference entities, and the embodiments of the present disclosure are not limited here.

其中,第二掩模生成模型可采用Segment Everything Everywhere All at OnceModel（SEEM）模型，也可采用其他模型，本公开实施例在此不作限定。Among them, the second mask generation model may adopt the Segment Everything Everywhere All at Once Model (SEEM) model, or other models, which is not limited in the embodiments of the present disclosure.

需要说明的是，匹配度阈值可通过训练后的第一回归模型的预测得到或者通过多次试验确定，本公开实施例在此不作限定。It should be noted that the matching degree threshold can be obtained through the prediction of the trained first regression model or determined through multiple experiments, and the embodiments of the present disclosure are not limited thereto.

基于此，将原始图像和对应的参考名称文本输入至第二掩模生成模型进行掩模预测，通过参考名称文本扩大关注对象的语义范围，能够提高第二掩模生成模型生成参考掩模的质量，然后确定标注掩模与对应的参考掩模之间的匹配程度，当目标匹配度小于预设的匹配度阈值时，代表第一掩模生成模型可能生成了低质量的标注掩模，通过剔除该目标匹配度对应的原始图像，相当于对原始图像进行过滤，能够避免将低质量的训练样本存储至数据集，从而有效提高第一大语言模型的训练质量；通过将第二掩模生成模型生成的参考掩模作为补充策略，过滤不合适的标注掩模，以应对第一掩模生成模型在生成过程中可能出现的错误传播问题。Based on this, the original image and the corresponding reference name text are input into the second mask generation model for mask prediction. By expanding the semantic scope of the object of interest through the reference name text, the quality of the reference mask generated by the second mask generation model can be improved. Then, the degree of matching between the annotation mask and the corresponding reference mask is determined. When the target matching degree is less than the preset matching degree threshold, it means that the first mask generation model may have generated a low-quality annotation mask. By eliminating the original image corresponding to the target matching degree, it is equivalent to filtering the original image, which can avoid storing low-quality training samples in the data set, thereby effectively improving the training quality of the first language model. By using the reference mask generated by the second mask generation model as a supplementary strategy, inappropriate annotation masks are filtered to deal with the error propagation problem that may occur in the generation process of the first mask generation model.

具体地，与标记区域的处理方式类似，各个标注掩模与对应的参考掩模之间的匹配程度也可通过计算交并比或者中心距离等方式确定。Specifically, similar to the processing method of the marked area, the matching degree between each annotation mask and the corresponding reference mask can also be determined by calculating the intersection-over-union ratio or the center distance.

在一种可能的实现方式中，除了将参考名称文本输入至第二掩模生成模型之外，还可将链接实体的名称文本、描述文本、类别文本等内容输入至第二掩模生成模型，能够进一步提高第二掩模生成模型生成参考掩模的质量。In one possible implementation, in addition to inputting the reference name text into the second mask generation model, the name text, description text, category text and other contents of the linked entity may also be input into the second mask generation model, which can further improve the quality of the reference mask generated by the second mask generation model.

在一种可能的实现方式中，针对不同的任务目标、数据处理需求或者应用场景，可将各个原始图像以及对应查询文本划分至不同的原始子集，假设原始子集包括实体子集和查询子集，实体子集的任务目标是让模型能够从原始图像中识别出具体的实体，实体子集的查询文本通常用于描述应从图像中识别的实体类型，例如，原始图像展示了一只猫，查询文本可为“识别图中的动物”；而查询子集的任务目标是让模型既能识别图像中的实体，还能理解查询文本的语境和意图，例如，原始图像展示了一个戴着发带的女孩，查询文本可为“小女孩的头发上戴着什么”。In one possible implementation, each original image and the corresponding query text can be divided into different original subsets according to different task objectives, data processing requirements or application scenarios. Assume that the original subset includes an entity subset and a query subset. The task goal of the entity subset is to enable the model to recognize specific entities from the original image. The query text of the entity subset is usually used to describe the type of entity that should be recognized from the image. For example, the original image shows a cat, and the query text can be "identify the animal in the image"; and the task goal of the query subset is to enable the model to not only recognize the entities in the image, but also understand the context and intent of the query text. For example, the original image shows a girl wearing a headband, and the query text can be "what is the little girl wearing in her hair".

具体地，原始子集还可包括人类标注子集，人类标注子集的任务目标可以与实体子集或者查询子集的任务目标相同，人类标注子集内的原始图像经过人工筛选，人类标注子集内的查询文本由人工设定，能够确保人类标注子集的质量。Specifically, the original subset may also include a human-annotated subset. The task objective of the human-annotated subset may be the same as the task objective of the entity subset or the query subset. The original images in the human-annotated subset are manually screened, and the query text in the human-annotated subset is manually set, which can ensure the quality of the human-annotated subset.

在一种可能的实现方式中，在确定各个标注掩模与对应的参考掩模之间的匹配程度，得到各个标注掩模对应的目标匹配度之后，标识生成方法还包括：当目标匹配度大于或者等于匹配度阈值，且原始图像以及对应的查询文本来自实体子集时，无需调整标注掩模；或者，当目标匹配度大于或者等于匹配度阈值，且原始图像以及对应的查询文本来自查询子集时，将各个标注掩模分别替换为对应的参考掩模。In one possible implementation, after determining the degree of matching between each annotation mask and the corresponding reference mask and obtaining the target matching degree corresponding to each annotation mask, the identification generation method also includes: when the target matching degree is greater than or equal to the matching degree threshold, and the original image and the corresponding query text are from the entity subset, there is no need to adjust the annotation mask; or, when the target matching degree is greater than or equal to the matching degree threshold, and the original image and the corresponding query text are from the query subset, each annotation mask is replaced with the corresponding reference mask.

基于此，能够提高实体子集中各个原始图像的标注掩模的准确性，还能够提高查询子集中各个原始图像的标注掩模的准确性。Based on this, the accuracy of the annotation mask of each original image in the entity subset can be improved, and the accuracy of the annotation mask of each original image in the query subset can also be improved.

在一种可能的实现方式中，将各个原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集之前，标识生成方法还包括：统计各个标注掩模中连通区域的数量，得到各个标注掩模对应的区域数量；当区域数量大于预设的数量阈值时，剔除区域数量对应的原始图像。In a possible implementation, before each original image, the corresponding annotation mask, and the corresponding link entity are associated and stored in the data set, the identification generation method also includes: counting the number of connected areas in each annotation mask to obtain the number of areas corresponding to each annotation mask; when the number of areas is greater than a preset number threshold, discarding the original image corresponding to the number of areas.

其中，连通区域是由标志掩模中具有相同像素值并彼此相邻的像素点所构成的区域，标注掩模可为二值图像，连通区域内像素点的像素值为1，连通区域之外的区域内像素点的像素值为0，连通区域相当于标志掩模中的关注区域。Among them, the connected area is an area composed of pixels with the same pixel value and adjacent to each other in the mark mask. The annotation mask can be a binary image. The pixel value of the pixel points in the connected area is 1, and the pixel value of the pixel points in the area outside the connected area is 0. The connected area is equivalent to the focus area in the mark mask.

通常情况下，一个对象在图像中所占的区域大部分或者全部是连通的，因此，当连通区域的数量过多时，代表标注掩模可能指示了原始图像中的多个对象，即标注掩模无法准确地仅指示原始图像中的关注对象。Usually, most or all of the area occupied by an object in an image is connected. Therefore, when the number of connected areas is too large, it means that the annotation mask may indicate multiple objects in the original image, that is, the annotation mask cannot accurately indicate only the object of interest in the original image.

基于此，当区域数量大于数量阈值时，可认为连通区域的数量过多，将该标注掩模定义为低质量的掩模，通过剔除对应的原始图像，能够避免将低质量的训练样本存储至数据集，从而有效提高第一大语言模型的训练质量。Based on this, when the number of regions is greater than the quantity threshold, it can be considered that the number of connected regions is too large, and the annotation mask is defined as a low-quality mask. By eliminating the corresponding original image, it is possible to avoid storing low-quality training samples in the data set, thereby effectively improving the training quality of the first language model.

需要说明的是，数量阈值可通过训练后第二回归模型的预测得到或者通过多次试验确定，本公开实施例在此不作限定。It should be noted that the quantity threshold can be obtained through the prediction of the second regression model after training or determined through multiple experiments, and the embodiments of the present disclosure are not limited thereto.

在一种可能的实现方式中，剔除原始图像相当于过滤训练样本，由于不属于视觉实体的关注对象无法有效地被标注掩模指示，所以除了基于区域数量过滤训练样本之外，可利用过滤器对训练样本进行过滤，例如，通过文本过滤器对查询文本进行过滤，能够剔除关注对象不属于视觉实体的原始图像，比如原始图像的关注对象是某个会议，就剔除该原始图像。In one possible implementation, eliminating original images is equivalent to filtering training samples. Since the objects of interest that do not belong to visual entities cannot be effectively indicated by the annotation mask, in addition to filtering training samples based on the number of regions, filters can be used to filter training samples. For example, by filtering the query text through a text filter, original images whose objects of interest do not belong to visual entities can be eliminated. For example, if the object of interest of the original image is a certain meeting, the original image is eliminated.

具体地，参考图7，图7为本公开实施例提供的多种样本集中实体类别的一种可选的分布示意图。Specifically, referring to FIG. 7 , FIG. 7 is a schematic diagram of an optional distribution of entity categories in various sample sets provided by an embodiment of the present disclosure.

其中，将由过滤前且未携带标注掩模的所有训练样本构成的样本集定义为初始样本集，以及将过滤后的所有训练样本构成的样本集定义为优化样本集，可见，在各个实体类别中，优化样本集的样本数量小于初始样本集的样本数量，特别是在地点、建筑和体育类别中，原始图像中的关注对象通常不属于视觉实体，利用过滤器进行过滤，使得优化样本集的样本数量远小于初始样本集的样本数量。Among them, the sample set composed of all training samples before filtering and without annotation masks is defined as the initial sample set, and the sample set composed of all training samples after filtering is defined as the optimized sample set. It can be seen that in each entity category, the number of samples in the optimized sample set is smaller than the number of samples in the initial sample set, especially in the categories of place, building and sports. The objects of interest in the original image usually do not belong to visual entities. Filtering is performed using filters, so that the number of samples in the optimized sample set is much smaller than the number of samples in the initial sample set.

参考图8，图8为本公开实施例提供的优化样本集中实体类别的一种可选的饼状示意图。Refer to FIG8 , which is an optional pie-shaped diagram of entity categories in the optimized sample set provided by an embodiment of the present disclosure.

可见，在优化样本集中，其他类别、动物类别以及植物类别的占比是相对较大的，这些类别的关注对象通常属于视觉实体，属于视觉实体的关注对象能够有效地被标注掩模指示，因此，通过优化样本集中的训练样本，能够对第一大语言模型进行有效训练。It can be seen that in the optimized sample set, the proportion of other categories, animal categories and plant categories is relatively large. The objects of interest in these categories usually belong to visual entities, and the objects of interest belonging to visual entities can be effectively indicated by the annotation mask. Therefore, by optimizing the training samples in the sample set, the first language model can be effectively trained.

另外，参考图9，图9为本公开实施例提供的标注掩模的面积比的一种可选的分布示意图。In addition, refer to FIG. 9 , which is a schematic diagram of an optional distribution of area ratios of the annotation masks provided in an embodiment of the present disclosure.

其中，标注掩模的面积比具体为标注掩模的面积与对应的原始图像的面积之间的比值，可见，面积比的分布曲线整体平滑，而当面积比超过95%时，频率会略有上升，这是由于部分原始图像中存在密集的物体群，例如，原始图像包括成群的植被，在标注过程中，会将其视为一个连贯的对象，利用标注掩模指示成群的植被。Among them, the area ratio of the annotation mask is specifically the ratio between the area of the annotation mask and the area of the corresponding original image. It can be seen that the distribution curve of the area ratio is smooth as a whole, and when the area ratio exceeds 95%, the frequency will increase slightly. This is because there are dense groups of objects in some original images. For example, the original image includes groups of vegetation. During the annotation process, it will be regarded as a coherent object, and the annotation mask will be used to indicate the groups of vegetation.

下面对初始样本集以及优化样本集的一种可选的数据统计情况进行详细描述。An optional data statistics of the initial sample set and the optimized sample set is described in detail below.

对于初始样本集，数据统计情况如下表1所示：For the initial sample set, the data statistics are shown in Table 1 below:

表1Table 1

由表1可知，初始样本集可包括训练集、验证集、测试集以及人工标注集，初始样本集包含了5214965张图像的5245421个标注，共覆盖了20077个实体，其中，人工标注集是指人为标注的数据集，“可见”是指图像中的关注对象通常属于视觉实体，“不可见”是指图像中的关注对象通常不属于视觉实体。As can be seen from Table 1, the initial sample set may include a training set, a validation set, a test set, and a manually annotated set. The initial sample set contains 5,214,965 images with 5,245,421 annotations, covering a total of 20,077 entities. The manually annotated set refers to a manually annotated data set, “visible” means that the object of interest in the image usually belongs to a visual entity, and “invisible” means that the object of interest in the image usually does not belong to a visual entity.

对于优化样本集，数据统计情况如下表2所示：For the optimized sample set, the data statistics are shown in Table 2 below:

表2Table 2

由表2可知，优化样本集包含了1965145个携带有标注掩模的训练样本，能够对第一大语言模型进行有效训练。As can be seen from Table 2, the optimized sample set contains 1,965,145 training samples with labeled masks, which can effectively train the first language model.

在一种可能的实现方式中，获取样本图像所链接的样本实体的第一实体标识，具体可以是获取样本图像所链接的样本实体的样本名称文本，对样本名称文本进行分词得到多个第一分词；确定各个第一分词在知识库中的出现频率，基于各个出现频率由小至大的顺序，对各个第一分词进行排序，将排列在前L位的第一分词确定为第二分词；基于各个第二分词，确定样本实体的第一实体标识。In a possible implementation, the first entity identifier of the sample entity linked to the sample image is obtained, specifically, the sample name text of the sample entity linked to the sample image is obtained, and the sample name text is segmented to obtain multiple first participles; the frequency of occurrence of each first participle in the knowledge base is determined, and the first participles are sorted in order from small to large based on the order of their occurrence frequencies, and the first participles arranged in the first L positions are determined as second participles; based on each second participle, the first entity identifier of the sample entity is determined.

其中，样本名称文本用于指示样本实体且具有可读性，可通过多种方式对样本名称文本进行分词，例如将样本名称文本输入至文本分词器进行分词，或者按照固定的长度对样本名称文本进行分词，本公开实施例在此不作限定。Among them, the sample name text is used to indicate the sample entity and is readable. The sample name text can be segmented in a variety of ways, such as inputting the sample name text into a text segmenter for segmentation, or segmenting the sample name text according to a fixed length, which is not limited in the embodiments of the present disclosure.

其中，知识库包括多个候选实体以及对应的候选名称文本，候选名称文本为知识库中的文本语料，在确定第一分词在知识库中的出现频率时，可先对知识库中所有文本语料进行分词，然后通过所有分词结构构建分词并集，再将第一分词在分词并集中的出现频率作为第一分词在知识库中的出现频率。可选地，文本语料除了包含候选名称文本以外，也可包含候选实体的描述文本，或者其他文本，本公开实施例在此不作限定。Wherein, the knowledge base includes multiple candidate entities and corresponding candidate name texts, and the candidate name texts are text corpora in the knowledge base. When determining the frequency of occurrence of the first participle in the knowledge base, all text corpora in the knowledge base can be firstly segmented, and then a participle union is constructed through all participle structures, and then the frequency of occurrence of the first participle in the participle union is used as the frequency of occurrence of the first participle in the knowledge base. Optionally, in addition to the candidate name texts, the text corpora may also include description texts of the candidate entities, or other texts, which are not limited in the embodiments of the present disclosure.

其中，L为正整数，基于各个出现频率由小至大的顺序，对各个第一分词进行排序，将排列在前L位的第一分词确定为第二分词，相当于将各个第一分词按照在分词并集中的出现频率升序后取前L个分词作为第二分词；通常情况下，L小于或者等于第一分词的数量，当L的数量大于第一分词的个数时，将所有排序后的第一分词确定为第二分词即可，L的取值通常较小，例如，L可取4，能够提高第一大语言模型的解码效率。Wherein, L is a positive integer, and each first participle is sorted in order of appearance frequency from small to large, and the first participle arranged in the first L positions is determined as the second participle, which is equivalent to taking the first L participles as the second participle after sorting each first participle in ascending order of appearance frequency in the participle union; usually, L is less than or equal to the number of first participles. When the number of L is greater than the number of first participles, all sorted first participles are determined as the second participle. The value of L is usually small. For example, L can be 4, which can improve the decoding efficiency of the first language model.

基于此，对样本名称文本进行分词得到第二分词，能够将长文本的样本名称文本分解为更小的单元，使得第一大语言模型能够更有效地学习和理解语言的结构和意义，还能够提高处理效率，通过将各个第一分词按照在分词并集中的出现频率升序后取前L个分词，相当于选择出现频率最低的分词，使得由所有第二分词确定的第一实体标识更具独特性，能够减少混淆，而且出现频率越低的第二分词排在越前面，使得第一大语言模型能够先解码出更具区分性的结果，从而提高实体标识的预测准确率；另外，由于第一实体标识最多由L个第二分词确定，所以实体标识的长度有限，能够提高第一大语言模型的解码效率。Based on this, the sample name text is segmented to obtain the second segmentation, which can decompose the long text sample name text into smaller units, so that the first language model can learn and understand the structure and meaning of the language more effectively, and can also improve the processing efficiency. By taking the first L segmentations after sorting the first segmentations in ascending order of the frequency of occurrence in the segmentation union, it is equivalent to selecting the segmentation with the lowest frequency of occurrence, so that the first entity identifier determined by all the second segmentations is more unique, which can reduce confusion, and the second segmentation with a lower frequency of occurrence is ranked in the front, so that the first language model can decode a more distinctive result first, thereby improving the prediction accuracy of the entity identifier; in addition, since the first entity identifier is determined by at most L second segmentations, the length of the entity identifier is limited, which can improve the decoding efficiency of the first language model.

具体地，第一实体标识可包括一个样本标识符或者多个样本标识符；样本标识符可为分词或者分词的索引，样本标识符也可为其他形式，本公开实施例在此不作限定。Specifically, the first entity identifier may include one sample identifier or multiple sample identifiers; the sample identifier may be a word segment or an index of a word segment, and the sample identifier may also be in other forms, which is not limited in the embodiments of the present disclosure.

以候选标识符具体为分词为例，当第一实体标识为由多个样本标识符组成的序列时，第一实体标识的确定公式如下：Taking the candidate identifier as a word segment as an example, when the first entity identifier is a sequence composed of multiple sample identifiers, the formula for determining the first entity identifier is as follows:

其中，为第一实体标识，为文本分词器，为样本名称文本，为多个第一分词，为知识库，为知识库中的第个候选实体，是指对第个候选实体的候选名称文本进行分词，是指知识库中所有候选实体的候选名称文本的分词结果的分词并集，用于按照词频升序后取前L个分词，是指各个第一分词按照在分词并集中的出现频率升序后取前L个分词，即包括各个第二分词，各个第二分词均可作为样本标识符，因此，第一实体标识由所有样本标识符组成。in, is the first entity identifier, is a text tokenizer, is the sample name text, For multiple first participles, For the knowledge base, The first candidate entities, It refers to The candidate name text of the candidate entity is segmented. It refers to the word segmentation union of the word segmentation results of the candidate name texts of all candidate entities in the knowledge base. Used to take the first L words in ascending order of word frequency, It means that each first participle is in the participle union Take the first L words in ascending order of frequency of occurrence, that is Including each second participle, each second participle can be used as a sample identifier, so the first entity identifier Consists of all sample identifiers.

示例性地，假设样本名称文本为Golf course，对样本名称文本进行分词得到三个第一分词，分别为[_G]、[olf]和[_course]，基于各个出现频率由小至大的顺序，对各个第一分词进行排序，得到[_course][olf][_G]，假设L等于3，那么第一实体标识为[_course][olf][_G]。For example, assuming that the sample name text is Golf course, the sample name text is segmented to obtain three first participles, namely [_G], [olf] and [_course]. Based on the order of their occurrence frequency from small to large, the first participles are sorted to obtain [_course][olf][_G]. Assuming that L is equal to 3, the first entity identifier is [_course][olf][_G].

在一种可能的实现方式中，将第二图像特征、样本视觉特征以及样本位置特征拼接后输入至第一大语言模型进行文本预测，生成预测概率分布，具体可以是构建用于提示第一大语言模型生成实体标识的提示文本；提取提示文本的文本特征，将第二图像特征、文本特征、样本视觉特征以及样本位置特征拼接后输入至第一大语言模型进行文本预测，生成预测概率分布。In a possible implementation, the second image features, sample visual features, and sample position features are spliced and input into the first large language model for text prediction to generate a predicted probability distribution. Specifically, a prompt text is constructed to prompt the first large language model to generate an entity identifier; text features of the prompt text are extracted, and the second image features, text features, sample visual features, and sample position features are spliced and input into the first large language model for text prediction to generate a predicted probability distribution.

具体地，基于第一大语言模型生成的预测概率分布能够确定每个样本标识符对应的预测标识符，所有预测标识符能够组成预测实体标识，预测标识符的确定公式如下：Specifically, the predicted probability distribution generated based on the first language model can determine the predicted identifier corresponding to each sample identifier, and all predicted identifiers can form a predicted entity identifier. The formula for determining the predicted identifier is as follows:

其中，为预测实体标识中第个位置的预测标识符，为预测实体标识中第个位置之前的预测标识符，为文本特征，为第二图像特征，为样本视觉特征以及样本位置特征，用于指示词嵌入矩阵，用于指示基于词嵌入矩阵对进行词嵌入处理，为预测实体标识中第个位置之前的预测标识符在词嵌入矩阵中对应的词嵌入向量，用于指示第一大语言模型，第一大语言模型能够基于输入的、、以及，生成预测实体标识中第个位置的概率分布，进而基于概率分布预测出预测实体标识中第个位置的预测标识符，然后再将预测得到的预测标识符对应的词嵌入向量输入第一大语言模型，以供第一大语言模型预测出下一个位置预测标识符，直至满足结束预设的结束条件，所有预测得到的预测标识符能够组成预测实体标识。in, The first The predicted identifiers for the locations, The first The predicted identifier before positions, is the text feature, is the second image feature, are sample visual features and sample position features, Used to indicate the word embedding matrix, Used to indicate the word embedding matrix Perform word embedding processing. The first The word embedding vector corresponding to the predicted identifier before the position in the word embedding matrix, Used to indicate the first largest language model, which can be based on the input , , as well as , generating the predicted entity identifier The probability distribution of the positions is then used to predict the predicted entity identifier. The predicted identifier of the position is then input into the first largest language model, so that the first largest language model can predict the predicted identifier of the next position, until the preset end condition is met, and all the predicted identifiers can form a predicted entity identifier.

下面详细说明标识生成方法的完整过程，完整过程包括训练阶段和推理阶段。The complete process of the identification generation method is described in detail below. The complete process includes a training phase and an inference phase.

下面对训练阶段进行详细描述。The training phase is described in detail below.

参照图10，图10为本公开实施例提供的训练阶段的一种可选的构架示意图。Refer to FIG. 10 , which is a schematic diagram of an optional architecture of the training phase provided in an embodiment of the present disclosure.

首先，获取多个原始图像以及各个原始图像对应的查询文本，根据各个原始图像以及对应的查询文本，分别确定各个原始图像对应的识别信息，其中，查询文本用于提示识别出对应的原始图像中的关注对象；获取多个候选实体，基于各个识别信息，分别在多个候选实体中确定各个原始图像对应的链接实体。First, a plurality of original images and query texts corresponding to the respective original images are obtained, and identification information corresponding to the respective original images is determined respectively according to the respective original images and the corresponding query texts, wherein the query texts are used to prompt the identification of the objects of interest in the corresponding original images; a plurality of candidate entities are obtained, and based on the respective identification information, the link entities corresponding to the respective original images are determined respectively among the plurality of candidate entities.

然后，将各个查询文本分别输入至第二大语言模型进行文本预测，生成各个原始图像对应的概括文本；分别基于各个原始图像和对应的概括文本进行对象检测，生成各个原始图像对应的原始边界框，其中，原始边界框用于指示对应的关注对象；分别将各个原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，生成各个原始图像对应的标注掩模，其中，标注掩模用于指示对应的关注对象。Then, each query text is input into the second largest language model for text prediction to generate a summary text corresponding to each original image; object detection is performed based on each original image and the corresponding summary text to generate an original bounding box corresponding to each original image, wherein the original bounding box is used to indicate the corresponding object of interest; each original image and the corresponding original bounding box is input into the first mask generation model for mask prediction to generate a labeled mask corresponding to each original image, wherein the labeled mask is used to indicate the corresponding object of interest.

然后，获取各个链接实体对应的参考名称文本，其中，参考名称文本用于指示参考实体的名称，参考实体在知识库中的层级高于链接实体在知识库中的层级；分别将各个原始图像和对应的参考名称文本输入至第二掩模生成模型进行掩模预测，生成各个原始图像对应的参考掩模；确定各个标注掩模与对应的参考掩模之间的匹配程度，得到各个标注掩模对应的目标匹配度；当目标匹配度小于预设的匹配度阈值时，剔除目标匹配度对应的原始图像。Then, obtain the reference name text corresponding to each linked entity, wherein the reference name text is used to indicate the name of the reference entity, and the level of the reference entity in the knowledge base is higher than the level of the linked entity in the knowledge base; input each original image and the corresponding reference name text into the second mask generation model for mask prediction, and generate a reference mask corresponding to each original image; determine the degree of matching between each annotated mask and the corresponding reference mask, and obtain the target matching degree corresponding to each annotated mask; when the target matching degree is less than the preset matching degree threshold, eliminate the original image corresponding to the target matching degree.

然后，统计各个标注掩模中连通区域的数量，得到各个标注掩模对应的区域数量；当区域数量大于预设的数量阈值时，剔除区域数量对应的原始图像；然后，将各个原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集，其中，样本图像从多个原始图像中采样得到，第二掩模为样本图像对应的标注掩模，样本实体为样本图像所链接的链接实体。Then, the number of connected areas in each annotation mask is counted to obtain the number of areas corresponding to each annotation mask; when the number of areas is greater than a preset threshold, the original image corresponding to the number of areas is eliminated; then, each original image, the corresponding annotation mask and the corresponding linked entity are associated and stored in a data set, wherein the sample image is sampled from multiple original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the linked entity linked to the sample image.

然后，获取样本图像以及样本图像中样本对象对应的第二掩模，对样本图像进行分割，得到样本图像中各个视觉对象对应的第一掩模，其中，样本对象为多个视觉对象中的一个对象，例如，可将样本图像输入至目标掩模生成模型进行全景分割，得到多个第一掩模；Then, a sample image and a second mask corresponding to a sample object in the sample image are obtained, and the sample image is segmented to obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one of the multiple visual objects. For example, the sample image can be input into a target mask generation model for panoramic segmentation to obtain multiple first masks;

然后，提取样本图像的第二图像特征，例如，可将样本图像输入至视觉编码器进行多层级特征提取，得到多层级编码特征；将多层级编码特征输入至第一多层感知器进行映射，得到第二图像特征；Then, extract the second image feature of the sample image. For example, the sample image can be input into a visual encoder for multi-level feature extraction to obtain multi-level coding features; the multi-level coding features are input into the first multi-layer perceptron for mapping to obtain the second image feature;

然后，对样本图像分别进行基于第二掩模以及各个第一掩模的特征提取，得到多个样本视觉特征，提取第二掩模以及各个第一掩模对应的样本位置特征，例如，可将多层级编码特征、第二掩模以及各个第一掩模输入至掩模感知视觉提取器，基于掩模感知视觉提取器对多层级编码特征进行特征提取，得到多个样本视觉特征，以及通过位置编码得到第二掩模以及各个第一掩模对应的多个样本位置特征。Then, feature extraction is performed on the sample image based on the second mask and each first mask to obtain a plurality of sample visual features, and sample position features corresponding to the second mask and each first mask are extracted. For example, the multi-level coding features, the second mask and each first mask can be input into a mask-aware visual extractor, and feature extraction is performed on the multi-level coding features based on the mask-aware visual extractor to obtain a plurality of sample visual features, and a plurality of sample position features corresponding to the second mask and each first mask are obtained through position coding.

然后，构建用于提示第一大语言模型生成实体标识的提示文本；提取提示文本的文本特征；将第二图像特征、提示文本的文本特征、样本视觉特征以及样本位置特征拼接后输入至第一大语言模型进行文本预测，生成预测概率分布，其中，预测概率分布用于确定样本对象的实体标识；Then, a prompt text is constructed to prompt the first language model to generate an entity identifier; text features of the prompt text are extracted; the second image features, the text features of the prompt text, the sample visual features, and the sample position features are spliced and input into the first language model for text prediction to generate a prediction probability distribution, wherein the prediction probability distribution is used to determine the entity identifier of the sample object;

然后，获取样本图像所链接的样本实体的样本名称文本，对样本名称文本进行分词得到多个第一分词；确定各个第一分词在知识库中的出现频率，基于各个出现频率由小至大的顺序，对各个第一分词进行排序，将排列在前L位的第一分词确定为第二分词，其中，L为正整数；基于各个第二分词，确定样本实体的第一实体标识；基于预测概率分布与第一实体标识确定模型损失，基于模型损失训练第一大语言模型，其中，样本实体用于指示样本对象。Then, a sample name text of a sample entity linked to the sample image is obtained, and the sample name text is segmented to obtain a plurality of first segmentations; the occurrence frequency of each first segmentation in the knowledge base is determined, and each first segmentation is sorted in order of the occurrence frequency from small to large, and the first segmentations arranged in the first L positions are determined as second segmentations, where L is a positive integer; based on each second segmentation, a first entity identifier of the sample entity is determined; a model loss is determined based on the predicted probability distribution and the first entity identifier, and a first language model is trained based on the model loss, where the sample entity is used to indicate the sample object.

下面对推理阶段进行详细描述。The inference phase is described in detail below.

首先，获取目标图像以及目标图像的提示标记；对目标图像进行分割，得到各个候选对象对应的局部掩模。其中，目标图像包括多个候选对象，提示标记用于指示多个候选对象中被选择的目标对象。First, a target image and a prompt mark of the target image are obtained, and the target image is segmented to obtain a local mask corresponding to each candidate object. The target image includes multiple candidate objects, and the prompt mark is used to indicate the selected target object among the multiple candidate objects.

然后，当提示标记为标记点时，基于标记点与各个局部掩模之间的位置关系，在各个局部掩模中确定查询掩模；或者，当提示标记为标记框时，基于标记框与各个局部掩模的掩模边界之间的匹配程度，在各个局部掩模中确定查询掩模；或者，当提示标记为标记区域时，基于标记区域与各个局部掩模之间的匹配程度，在各个局部掩模中确定查询掩模。其中，查询掩模用于指示多个候选对象中被选择的目标对象。Then, when the hint mark is a mark point, the query mask is determined in each local mask based on the positional relationship between the mark point and each local mask; or, when the hint mark is a mark box, the query mask is determined in each local mask based on the matching degree between the mark box and the mask boundary of each local mask; or, when the hint mark is a mark area, the query mask is determined in each local mask based on the matching degree between the mark area and each local mask. The query mask is used to indicate the target object selected from multiple candidate objects.

然后，对目标图像进行多层级特征提取，得到目标图像的多层级视觉特征；分别基于查询掩模以及各个局部掩模，对多层级视觉特征进行掩模池化，得到多个多层级池化特征。Then, multi-level feature extraction is performed on the target image to obtain multi-level visual features of the target image; mask pooling is performed on the multi-level visual features based on the query mask and each local mask to obtain multiple multi-level pooling features.

然后，对于任意一个多层级池化特征，分别对多层级池化特征中各个层级的子特征进行映射，得到多个维度相同的中间特征，将各个中间特征进行特征融合，得到融合特征；分别对各个融合特征进行多层感知处理，得到多个掩模视觉特征；提取查询掩模以及各个局部掩模对应的掩模位置特征。Then, for any multi-level pooling feature, the sub-features of each level in the multi-level pooling feature are mapped respectively to obtain multiple intermediate features of the same dimension, and the intermediate features are fused to obtain fused features; multi-layer perception processing is performed on each fused feature to obtain multiple mask visual features; and the mask position features corresponding to the query mask and each local mask are extracted.

然后，将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；提取目标图像的第一图像特征，分别确定各个局部掩模的掩模面积，按照掩模面积的大小顺序，将各个局部掩模对应的区域特征进行拼接，得到第一拼接特征。Then, each mask visual feature is spliced with the corresponding mask position feature to obtain multiple regional features; the first image feature of the target image is extracted, the mask area of each local mask is determined respectively, and the regional features corresponding to each local mask are spliced in order of the size of the mask area to obtain the first splicing feature.

然后，构建用于提示第一大语言模型生成实体标识的提示文本；提取提示文本的文本特征，将第一图像特征、文本特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征；基于目标拼接特征进行文本预测，生成目标对象的目标实体标识。Then, a prompt text is constructed to prompt the first language model to generate an entity identifier; text features of the prompt text are extracted, and the first image features, text features, first splicing features, and regional features corresponding to the query mask are spliced to obtain target splicing features; text prediction is performed based on the target splicing features to generate a target entity identifier of the target object.

基于此，通过获取目标图像中各个候选对象对应的局部掩模，以及确定用于指示目标对象的查询掩模，进而通过特征提取得到查询掩模以及各个局部掩模对应的掩模视觉特征，并提取查询掩模以及各个局部掩模对应的掩模位置特征，掩模视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息，掩模位置特征能够捕捉到相应对象所在局部区域的像素级位置信息，然后将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到各个局部区域的区域特征，相当于将局部区域的像素级视觉信息和对应的像素级位置信息组合为像素级区域信息，然后提取目标图像的第一图像特征，并将第一图像特征以及多个区域特征拼接为目标拼接特征，然后基于目标拼接特征进行文本预测，生成目标对象的目标实体标识，在文本预测过程中，通过对目标拼接特征中的各个特征进行交互，既能关注第一图像特征所捕捉的全局视觉信息，又能关注各个候选对象对应的像素级区域信息，还能关注目标对象对应的像素级区域信息，因此，能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率，另外，将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。Based on this, by obtaining the local masks corresponding to each candidate object in the target image, and determining the query mask used to indicate the target object, and then obtaining the query mask and the mask visual features corresponding to each local mask through feature extraction, and extracting the query mask and the mask position features corresponding to each local mask, the mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located, and the mask position features can capture the pixel-level position information of the local area where the corresponding object is located, and then each mask visual feature is spliced with the corresponding mask position features respectively to obtain the regional features of each local area, which is equivalent to combining the pixel-level visual information of the local area and the corresponding pixel-level position information into pixel-level regional information, and then extracting the first image of the target image. Features, and splice the first image features and multiple regional features into target splicing features, and then perform text prediction based on the target splicing features to generate a target entity identifier of the target object. In the text prediction process, by interacting with each feature in the target splicing feature, it is possible to pay attention to both the global visual information captured by the first image feature and the pixel-level regional information corresponding to each candidate object and the pixel-level regional information corresponding to the target object. Therefore, it is possible to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier. In addition, using the pixel-level query mask as a visual cue can efficiently, flexibly and accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identifier.

下面先对本公开提供的模型的训练过程进行详细描述。The training process of the model provided by the present invention is first described in detail below.

本公开提供的第一大语言模型可经过两阶段的训练，在第一训练阶段中，通过包含200万样本的初始样本集对第一大语言模型进行预训练，在第二训练阶段中，通过优化样本集中的实体子集以及查询子集对预训练后的第一大语言模型进行微调。The first large language model provided by the present disclosure can be trained in two stages. In the first training stage, the first large language model is pre-trained using an initial sample set containing 2 million samples. In the second training stage, the pre-trained first large language model is fine-tuned by optimizing the entity subset and query subset in the sample set.

由于优化数据集包含约450万个训练样本，所以在微调过程中，考虑到计算资源的限制和优化数据集的巨大规模，可将每个实体对应的训练样本的数量限制在50个以内，实现了在实体子集以及查询子集中仅使用了大约7%的训练样本，总数量约30万个，能够有效节省计算资源，提高训练效率。此外，所有输入图像的尺寸都统一预处理为512×512，能够确保训练的一致性，将实体标识的长度限制为4，避免实体标识的长度过长，从而提高训练效率。Since the optimized dataset contains about 4.5 million training samples, during the fine-tuning process, considering the limitation of computing resources and the huge size of the optimized dataset, the number of training samples corresponding to each entity can be limited to 50 or less, so that only about 7% of the training samples in the entity subset and query subset are used, with a total of about 300,000, which can effectively save computing resources and improve training efficiency. In addition, the size of all input images is uniformly preprocessed to 512×512 to ensure the consistency of training, and the length of entity identifiers is limited to 4 to avoid the length of entity identifiers being too long, thereby improving training efficiency.

然后，下面对本公开提供的模型与其他模型的评估过程进行详细描述。Then, the evaluation process of the model provided by the present disclosure and other models is described in detail below.

评估所用的数据集可分为验证集和测试集，评估结果如下表3所示：The data set used for evaluation can be divided into a validation set and a test set. The evaluation results are shown in Table 3 below:

表3Table 3

其中，“无”表示未使用提示来引用视觉信息，是指基于检索的判别模型，是指生成模型，是指无需微调的零样本模型，“全部”所在列的准确率是由对应的实体子集以及查询子集所构成的数据集所确定的准确率；相较于本公开提供的模型，参考模型的区别在于将图像中的关注对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型，而未将图像中的其他对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型，参考模型-微调是指微调后的参考模型。Among them, "None" means that no cue is used to refer to visual information, refers to the retrieval-based discriminant model, is the generative model, It refers to a zero-sample model that does not require fine-tuning. The accuracy of the column where "all" is located is the accuracy determined by the data set composed of the corresponding entity subset and the query subset. Compared with the model provided in the present disclosure, the reference model is different in that the visual features and position features of the mask corresponding to the object of interest in the image are input into the first largest language model, while the visual features and position features of the masks corresponding to other objects in the image are not input into the first largest language model. The reference model-fine-tuning refers to the reference model after fine-tuning.

具体地，表3展示了不同提示类型的视觉语言（Vision-Language）模型在验证集和测试集上的准确率结果，将准确率作为衡量验证集和测试集上模型表现的关键指标，对于每个数据集，评估模型在实体子集以及查询子集上的表现，并计算各个子集中所有样本的总体准确率作为最终评估依据。Specifically, Table 3 shows the accuracy results of the vision-language model with different prompt types on the validation set and the test set. The accuracy is used as the key indicator to measure the performance of the model on the validation set and the test set. For each data set, the performance of the model on the entity subset and the query subset is evaluated, and the overall accuracy of all samples in each subset is calculated as the final evaluation basis.

另外，考虑到零样本推理模型在生成实体标识以及处理特定领域的实体名称文本等方面存在挑战，需要使用BM25检索来处理生成的结果，具体来说，需要搜索知识库中600万条实体名称文本，然后选择将最相近的搜索结果作为计算准确率的基础。In addition, considering the challenges of zero-shot inference models in generating entity identifiers and processing entity name texts in specific fields, BM25 retrieval is needed to process the generated results. Specifically, it is necessary to search 6 million entity name texts in the knowledge base and then select the most similar search results as the basis for calculating accuracy.

由表3可知，相较于基于文本提示的视觉语言模型，本公开提供的模型在实体子集上取得了显著的提升，性能差异在-2.0%至11.3%，这个评估结果表明，通过精细化的视觉特征建模，可以缓解缺乏文本先验所带来的挑战。As can be seen from Table 3, compared with the visual language model based on text cues, the model provided by the present disclosure has achieved significant improvement on the entity subset, with the performance difference ranging from -2.0% to 11.3%. This evaluation result shows that the challenges brought about by the lack of text priors can be alleviated through refined visual feature modeling.

然后，相较于和，本公开提供的模型在查询子集上的性能存在22%至42%的差距，可认为这是由于查询子集主要来源于视觉问答（VQA）而造成的，视觉问答的问题通常不仅涉及对视觉信息的引用，而且通常包含额外的查询意图，例如，“由……制成”、“产于”、“需要多少水”等，这些情况需要文本来表达用户意图和进行进一步推理，超出了VEL的范围。本公开提供的模型是基于视觉掩码的参考提示，所以无法较好地覆盖这些问题。基于此，对于查询子集中的数据，需要采用第二大语言模型提取其引用表达式，并将其替换为视觉掩码标记，以保留查询意图以外的参考信息。Then, compared to and , the performance of the model provided by the present disclosure on the query subset has a gap of 22% to 42%, which can be considered to be caused by the fact that the query subset mainly comes from visual question answering (VQA). The questions of visual question answering usually involve not only references to visual information, but also usually contain additional query intents, such as "made of...", "produced in", "how much water is needed", etc. These situations require text to express user intent and further reasoning, which is beyond the scope of VEL. The model provided by the present disclosure is a reference prompt based on a visual mask, so it cannot cover these problems well. Based on this, for the data in the query subset, it is necessary to use the second largest language model to extract its reference expression and replace it with a visual mask tag to retain the reference information other than the query intent.

然后，下面对本公开提供的模型的消融实验进行详细描述。Then, the ablation experiment of the model provided by the present disclosure is described in detail below.

消融实验结果如下表4所示：The ablation experiment results are shown in Table 4 below:

表4Table 4

其中，PT是指对模型进行预训练，FT是指对模型进行微调；相较于本公开提供的模型，参考模型的区别在于将图像中的关注对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型，而未将图像中的其他对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型，因此，参考模型可以视为在本公开提供的模型中去除了处理其他对象所对应的掩模的相应组件的模型。Among them, PT refers to pre-training the model, and FT refers to fine-tuning the model. Compared with the model provided in the present disclosure, the reference model differs in that the visual features and position features of the mask corresponding to the object of interest in the image are input into the first large language model, while the visual features and position features of the masks corresponding to other objects in the image are not input into the first large language model. Therefore, the reference model can be regarded as a model in which the corresponding components of processing masks corresponding to other objects are removed from the model provided in the present disclosure.

具体地，表4展示了本公开提供的模型的消融实验结果，能够评估视觉语义标记和训练的有效性，实验结果显示，引入细粒度的局部视觉特征，能够显著提高模型的准确率，在实体子集上的准确率增加了3.7%至5.0%，在查询子集上的准确率增加了3.5%至5.5%。此外，微调也显著提高了模型的整体准确率，而预训练的影响相对较小，预训练的改进幅度在0.1%至1.6%。Specifically, Table 4 shows the ablation experiment results of the model provided by the present disclosure, which can evaluate the effectiveness of visual semantic labeling and training. The experimental results show that the introduction of fine-grained local visual features can significantly improve the accuracy of the model, with the accuracy on the entity subset increased by 3.7% to 5.0%, and the accuracy on the query subset increased by 3.5% to 5.5%. In addition, fine-tuning also significantly improves the overall accuracy of the model, while the impact of pre-training is relatively small, with the improvement of pre-training ranging from 0.1% to 1.6%.

基于此，能够确定本公开提供的模型在预训练阶段中的成功主要是由于：构建了较大规模的预训练数据集，例如，预训练数据集包含5500万个样本；以及使用GenerativeImage-to-text Transformer（GIT）作为骨干（具有4亿参数），并结合随机初始化的文本解码器。通过结合有限的模型参数和原始预训练策略，有助于提供本公开提供的模型的预训练效果。Based on this, it can be determined that the success of the model provided by the present disclosure in the pre-training stage is mainly due to: the construction of a large-scale pre-training data set, for example, the pre-training data set contains 55 million samples; and the use of Generative Image-to-text Transformer (GIT) as the backbone (with 400 million parameters) combined with a randomly initialized text decoder. By combining limited model parameters and the original pre-training strategy, it helps to provide the pre-training effect of the model provided by the present disclosure.

然后，下面对对本公开提供的模型的泛化能力进行详细描述。Then, the generalization ability of the model provided by the present disclosure is described in detail below.

表5Table 5

其中，是指基于检索的判别模型，是指生成模型，是指无需微调的零样本模型；“可见”是指图像中的关注对象通常属于视觉实体，“不可见”是指图像中的关注对象通常不属于视觉实体；相较于本公开提供的模型，参考模型的区别在于将图像中的关注对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型，而未将图像中的其他对象所对应的掩模的视觉特征以及位置特征输入至第一大语言模型；参考模型-微调是指微调后的参考模型。in, refers to the retrieval-based discriminant model, is the generative model, It refers to a zero-sample model that does not require fine-tuning; "visible" means that the object of interest in the image is usually a visual entity, and "invisible" means that the object of interest in the image is usually not a visual entity; compared with the model provided in the present disclosure, the reference model is different in that the visual features and position features of the mask corresponding to the object of interest in the image are input into the first largest language model, while the visual features and position features of the masks corresponding to other objects in the image are not input into the first largest language model; reference model-fine-tuning refers to the reference model after fine-tuning.

具体地，表5展示了不同模型在可见数据子集以及不可见数据子集上的准确率结果，由于本公开提供的模型缺乏文本先验，所以在确定泛能能力时是将与基于文本提示的模型进行比较。Specifically, Table 5 shows the accuracy results of different models on the visible data subset and the invisible data subset. Since the model provided by the present disclosure lacks text priors, it is compared with the model based on text prompts when determining the generalization ability.

其中，是基于文本提示的VEL模型，使用基于CLIP的编码器检索候选实体，然后通过基于MLLM的候选前缀树约束解码生成最终实体，这两个阶段通过多任务目标进行端到端优化。由于结合了检索增强（）和解码生成（），所以在可见实体子集（约30%）以及不可见实体子集（约10%）上的准确率都相对较高。in, It is a VEL model based on text prompts. Candidate entities are retrieved using a CLIP-based encoder, and the final entity is generated through MLLM-based candidate prefix tree constraint decoding. These two stages are optimized end-to-end through a multi-task objective. Combined with search enhancement ( ) and decode to generate ( ),so The accuracy is relatively high on both the visible entity subset (about 30%) and the invisible entity subset (about 10%).

值得注意的是，为没有检索增强以及约束解码生成的模型，与本公开提供的模型的性能非常接近，准确率的差异范围在-0.2%至+0.5%，这表明检索增强可能是提高模型泛化能力的有效途径，因此，在第一大语言模型的训练过程中，需要从多个候选实体中确定各个原始图像对应的链接实体时，可引入检索增强的方式，从而提高第一大语言模型的泛化能力。It is worth noting that Generated for no search enhancement and constraint decoding Model, The performance is very close to that of the model provided in the present invention, and the difference in accuracy ranges from -0.2% to +0.5%, which indicates that retrieval enhancement may be an effective way to improve the generalization ability of the model. Therefore, in the training process of the first language model, when it is necessary to determine the linked entities corresponding to each original image from multiple candidate entities, the retrieval enhancement method can be introduced to improve the generalization ability of the first language model.

可见，本公开实施例提供的标识生成方法可以应用于多种领域。It can be seen that the identification generation method provided by the embodiment of the present disclosure can be applied to various fields.

例如，在智慧交通领域中，能够通过车辆传感器采集到道路图像，可将道路图像作为目标图像，然后，获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，其中，候选对象可为行人、车辆、交通标志等等，查询掩模用于指示多个候选对象中被选择的目标对象，例如，司机或者乘客能够与车载系统进行交互，以供车载系统获取目标图像的提示标记，提示标记用于指示多个候选对象中被选择的目标对象；对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征；将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征；基于目标拼接特征进行文本预测，生成目标对象的目标实体标识；能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率；然后，可准确地将目标对象与目标实体标识所指示的实体进行链接，实现视觉实体链接，进而能够查询到被链接实体在知识库中的关联信息，从而帮助车载系统更充分地理解目标对象的信息。For example, in the field of smart transportation, road images can be collected through vehicle sensors, and the road images can be used as target images. Then, local masks corresponding to each candidate object in the target image are obtained, and a query mask is determined from multiple local masks, wherein the candidate objects may be pedestrians, vehicles, traffic signs, etc., and the query mask is used to indicate the target object selected from multiple candidate objects. For example, the driver or passenger can interact with the on-board system so that the on-board system can obtain a prompt mark of the target image, and the prompt mark is used to indicate the target object selected from multiple candidate objects; feature extraction based on the query mask and each local mask is performed on the target image to obtain multiple mask visual features, and the query mask and each local mask are extracted. corresponding mask position features; splicing each mask visual feature with the corresponding mask position feature to obtain multiple regional features; extracting the first image feature of the target image, splicing the first image feature and the multiple regional features to obtain the target splicing feature; performing text prediction based on the target splicing feature to generate the target entity identification of the target object; being able to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identification; then, being able to accurately link the target object with the entity indicated by the target entity identification to realize visual entity linking, and then being able to query the associated information of the linked entity in the knowledge base, thereby helping the vehicle-mounted system to more fully understand the information of the target object.

又例如，在医学领域中，能够通过医学仪器采集到医学图像，例如，医学图像包括磁共振成像图像、病理图像和计算机断层扫描图像等等，然后，获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，其中，候选对象可为器官、组织等等，查询掩模用于指示多个候选对象中被选择的目标对象，例如，医生能够与医学系统进行交互，以供医学系统获取目标图像的提示标记，提示标记用于指示多个候选对象中被选择的目标对象；对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征；将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征；基于目标拼接特征进行文本预测，生成目标对象的目标实体标识；能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率；然后，可准确地将目标对象与目标实体标识所指示的实体进行链接，实现视觉实体链接，进而能够查询到被链接实体在知识库中的关联信息，从而帮助医学系统更充分地理解目标对象的信息。For another example, in the medical field, medical images can be collected by medical instruments. For example, medical images include magnetic resonance imaging images, pathological images, and computer tomography images, etc. Then, local masks corresponding to each candidate object in the target image are obtained, and a query mask is determined from multiple local masks, wherein the candidate objects may be organs, tissues, etc., and the query mask is used to indicate the target object selected from multiple candidate objects. For example, a doctor can interact with a medical system so that the medical system can obtain a prompt mark of the target image, and the prompt mark is used to indicate the target object selected from multiple candidate objects; feature extraction is performed on the target image based on the query mask and each local mask, respectively, to obtain multiple mask visual features, and the query mask and each local mask are extracted. mask position features corresponding to the local mask; splicing each mask visual feature with the corresponding mask position feature to obtain multiple regional features; extracting the first image feature of the target image, splicing the first image feature and multiple regional features to obtain the target splicing feature; performing text prediction based on the target splicing feature to generate the target entity identification of the target object; being able to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identification; then, being able to accurately link the target object with the entity indicated by the target entity identification to achieve visual entity linking, and then being able to query the associated information of the linked entity in the knowledge base, thereby helping the medical system to more fully understand the information of the target object.

可以理解的是，虽然上述各个流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本实施例中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，上述流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时间执行完成，而是可以在不同的时间执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It is to be understood that, although each step in the above-mentioned each flow chart is shown in turn according to the indication of the arrow, these steps are not necessarily performed in turn according to the order indicated by the arrow. Unless there is a clear explanation in the present embodiment, the execution of these steps does not have a strict order restriction, and these steps can be performed in other orders. Moreover, at least a portion of the steps in the above-mentioned flow chart can include a plurality of steps or a plurality of stages, and these steps or stages are not necessarily performed at the same time, but can be performed at different times, and the execution order of these steps or stages is not necessarily performed in turn, but can be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

参照图11，图11为本公开实施例提供的标识生成装置的一种可选的结构示意图，该标识生成装置1100包括：Referring to FIG. 11 , FIG. 11 is a schematic diagram of an optional structure of an identification generating device provided in an embodiment of the present disclosure, and the identification generating device 1100 includes:

获取模块1101，用于获取目标图像中各个候选对象对应的局部掩模，在多个局部掩模中确定查询掩模，其中，查询掩模用于指示多个候选对象中被选择的目标对象；An acquisition module 1101 is used to acquire a local mask corresponding to each candidate object in a target image, and determine a query mask from a plurality of local masks, wherein the query mask is used to indicate a target object selected from a plurality of candidate objects;

特征提取模块1102，用于对目标图像分别进行基于查询掩模以及各个局部掩模的特征提取，得到多个掩模视觉特征，提取查询掩模以及各个局部掩模对应的掩模位置特征；The feature extraction module 1102 is used to perform feature extraction on the target image based on the query mask and each local mask, obtain multiple mask visual features, and extract mask position features corresponding to the query mask and each local mask;

第一拼接模块1103，用于将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到多个区域特征；A first splicing module 1103 is used to splice each mask visual feature with the corresponding mask position feature to obtain multiple regional features;

第二拼接模块1104，用于提取目标图像的第一图像特征，将第一图像特征以及多个区域特征进行拼接，得到目标拼接特征；The second stitching module 1104 is used to extract the first image feature of the target image, and stitch the first image feature and multiple regional features to obtain a target stitching feature;

生成模块1105，用于基于目标拼接特征进行文本预测，生成目标对象的目标实体标识。The generating module 1105 is used to perform text prediction based on the target concatenation feature and generate a target entity identifier of the target object.

进一步，上述第二拼接模块1104具体用于：Furthermore, the second splicing module 1104 is specifically used for:

分别确定各个局部掩模的掩模面积，按照掩模面积的大小顺序，将各个局部掩模对应的区域特征进行拼接，得到第一拼接特征；Determine the mask area of each local mask respectively, and splice the regional features corresponding to each local mask in order of the size of the mask area to obtain a first splicing feature;

将第一图像特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征。The first image feature, the first stitching feature, and the regional feature corresponding to the query mask are stitched together to obtain a target stitching feature.

构建用于提示第一大语言模型生成实体标识的提示文本；Constructing a prompt text for prompting the first language model to generate entity identification;

提取提示文本的文本特征，将第一图像特征、文本特征、第一拼接特征以及查询掩模对应的区域特征进行拼接，得到目标拼接特征。The text feature of the prompt text is extracted, and the first image feature, the text feature, the first splicing feature and the regional feature corresponding to the query mask are spliced to obtain the target splicing feature.

进一步，上述特征提取模块1102具体用于：Furthermore, the feature extraction module 1102 is specifically used for:

对目标图像进行多层级特征提取，得到目标图像的多层级视觉特征；Perform multi-level feature extraction on the target image to obtain multi-level visual features of the target image;

分别基于查询掩模以及各个局部掩模，对多层级视觉特征进行掩模池化，得到多个多层级池化特征；Based on the query mask and each local mask, the multi-level visual features are masked and pooled to obtain multiple multi-level pooled features;

分别对各个多层级池化特征进行特征融合，得到多个掩模视觉特征。Each multi-level pooling feature is fused separately to obtain multiple mask visual features.

对于任意一个多层级池化特征，分别对多层级池化特征中各个层级的子特征进行映射，得到多个维度相同的中间特征，将各个中间特征进行特征融合，得到融合特征；For any multi-level pooling feature, the sub-features of each level in the multi-level pooling feature are mapped respectively to obtain multiple intermediate features with the same dimensions, and the intermediate features are fused to obtain the fused feature;

分别对各个融合特征进行多层感知处理，得到多个掩模视觉特征。Multi-layer perception processing is performed on each fusion feature to obtain multiple mask visual features.

进一步，上述获取模块1101具体用于：Furthermore, the acquisition module 1101 is specifically used for:

获取目标图像以及目标图像的提示标记，其中，目标图像包括多个候选对象，提示标记用于指示多个候选对象中被选择的目标对象；Acquire a target image and a prompt mark of the target image, wherein the target image includes a plurality of candidate objects, and the prompt mark is used to indicate a target object selected from the plurality of candidate objects;

对目标图像进行分割，得到各个候选对象对应的局部掩模，基于提示标记在各个局部掩模中确定查询掩模。The target image is segmented to obtain local masks corresponding to each candidate object, and the query mask is determined in each local mask based on the prompt mark.

当提示标记为标记点时，基于标记点与各个局部掩模之间的位置关系，在各个局部掩模中确定查询掩模；When the hint mark is a marked point, the query mask is determined in each local mask based on the positional relationship between the marked point and each local mask;

进一步，目标实体标识由第一大语言模型生成，上述标识生成装置还包括训练模块（图中未示出），训练模块具体用于：Furthermore, the target entity identifier is generated by the first language model. The identifier generation device further includes a training module (not shown in the figure). The training module is specifically used for:

获取样本图像以及样本图像中样本对象对应的第二掩模，对样本图像进行分割，得到样本图像中各个视觉对象对应的第一掩模，其中，样本对象为多个视觉对象中的一个对象；Acquire a sample image and a second mask corresponding to a sample object in the sample image, segment the sample image, and obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one of the multiple visual objects;

提取样本图像的第二图像特征，对样本图像分别进行基于第二掩模以及各个第一掩模的特征提取，得到多个样本视觉特征，提取第二掩模以及各个第一掩模对应的样本位置特征；Extracting a second image feature of the sample image, performing feature extraction based on the second mask and each of the first masks on the sample image to obtain a plurality of sample visual features, and extracting sample position features corresponding to the second mask and each of the first masks;

将第二图像特征、样本视觉特征以及样本位置特征拼接后输入至第一大语言模型进行文本预测，生成预测概率分布，其中，预测概率分布用于确定样本对象的实体标识；splicing the second image feature, the sample visual feature, and the sample position feature and inputting them into the first language model for text prediction to generate a prediction probability distribution, wherein the prediction probability distribution is used to determine the entity identifier of the sample object;

获取样本图像所链接的样本实体的第一实体标识，基于预测概率分布与第一实体标识确定模型损失，基于模型损失训练第一大语言模型，其中，样本实体用于指示样本对象。A first entity identifier of a sample entity linked to the sample image is obtained, a model loss is determined based on the predicted probability distribution and the first entity identifier, and a first language model is trained based on the model loss, wherein the sample entity is used to indicate a sample object.

进一步，样本图像、第二掩模以及第一实体标识均从数据集中获取，上述训练模块还用于：Furthermore, the sample image, the second mask and the first entity identifier are all obtained from the data set, and the training module is also used for:

获取多个原始图像以及各个原始图像对应的查询文本，根据各个原始图像以及对应的查询文本，分别确定各个原始图像对应的识别信息，其中，查询文本用于提示识别出对应的原始图像中的关注对象；Acquire multiple original images and query texts corresponding to each original image, and determine identification information corresponding to each original image according to each original image and the corresponding query text, wherein the query text is used to prompt identification of the object of interest in the corresponding original image;

获取多个候选实体，基于各个识别信息，分别在多个候选实体中确定各个原始图像对应的链接实体；Acquire multiple candidate entities, and determine link entities corresponding to each original image from the multiple candidate entities based on each piece of identification information;

基于各个原始图像以及对应的查询文本，分别确定各个原始图像的标注掩模，其中，标注掩模用于指示对应的关注对象；Based on each original image and the corresponding query text, respectively determine a labeling mask for each original image, wherein the labeling mask is used to indicate the corresponding object of interest;

将各个原始图像、对应的标注掩模以及对应的链接实体关联存储至数据集，其中，样本图像从多个原始图像中采样得到，第二掩模为样本图像对应的标注掩模，样本实体为样本图像所链接的链接实体。Each original image, the corresponding annotation mask and the corresponding link entity are associated and stored in a data set, wherein the sample image is sampled from multiple original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the link entity linked to the sample image.

将各个查询文本分别输入至第二大语言模型进行文本预测，生成各个原始图像对应的概括文本；Input each query text into the second largest language model for text prediction to generate summary text corresponding to each original image;

分别基于各个原始图像和对应的概括文本进行对象检测，生成各个原始图像对应的原始边界框，其中，原始边界框用于指示对应的关注对象；Performing object detection based on each original image and the corresponding summarized text, respectively, to generate an original bounding box corresponding to each original image, wherein the original bounding box is used to indicate the corresponding object of interest;

分别将各个原始图像和对应的原始边界框输入至第一掩模生成模型进行掩模预测，生成各个原始图像对应的标注掩模。Each original image and the corresponding original bounding box are respectively input into the first mask generation model for mask prediction to generate a labeled mask corresponding to each original image.

获取各个链接实体对应的参考名称文本，其中，参考名称文本用于指示参考实体的名称，参考实体在知识库中的层级高于链接实体在知识库中的层级；Obtaining a reference name text corresponding to each link entity, wherein the reference name text is used to indicate the name of the reference entity, and the level of the reference entity in the knowledge base is higher than the level of the link entity in the knowledge base;

分别将各个原始图像和对应的参考名称文本输入至第二掩模生成模型进行掩模预测，生成各个原始图像对应的参考掩模；Inputting each original image and the corresponding reference name text into the second mask generation model for mask prediction to generate a reference mask corresponding to each original image;

确定各个标注掩模与对应的参考掩模之间的匹配程度，得到各个标注掩模对应的目标匹配度；Determine the matching degree between each labeled mask and the corresponding reference mask, and obtain the target matching degree corresponding to each labeled mask;

当目标匹配度小于预设的匹配度阈值时，剔除目标匹配度对应的原始图像。When the target matching degree is less than a preset matching degree threshold, the original image corresponding to the target matching degree is discarded.

统计各个标注掩模中连通区域的数量，得到各个标注掩模对应的区域数量；Count the number of connected areas in each annotation mask to obtain the number of areas corresponding to each annotation mask;

当区域数量大于预设的数量阈值时，剔除区域数量对应的原始图像。When the number of regions is greater than a preset threshold, the original image corresponding to the number of regions is discarded.

获取样本图像所链接的样本实体的样本名称文本，对样本名称文本进行分词得到多个第一分词；Obtaining a sample name text of a sample entity linked to the sample image, and performing word segmentation on the sample name text to obtain a plurality of first word segmentations;

确定各个第一分词在知识库中的出现频率，基于各个出现频率由小至大的顺序，对各个第一分词进行排序，将排列在前L位的第一分词确定为第二分词，其中，L为正整数；Determine the occurrence frequency of each first participle in the knowledge base, sort each first participle based on the order of the occurrence frequency from small to large, and determine the first participle arranged in the first L positions as the second participle, where L is a positive integer;

基于各个第二分词，确定样本实体的第一实体标识。Based on each second participle, a first entity identifier of the sample entity is determined.

上述标识生成装置1100与标识生成方法基于相同的发明构思，通过获取目标图像中各个候选对象对应的局部掩模，以及确定用于指示目标对象的查询掩模，进而通过特征提取得到查询掩模以及各个局部掩模对应的掩模视觉特征，并提取查询掩模以及各个局部掩模对应的掩模位置特征，掩模视觉特征能够捕捉到相应对象所在局部区域的像素级视觉信息，掩模位置特征能够捕捉到相应对象所在局部区域的像素级位置信息，然后将各个掩模视觉特征分别与对应的掩模位置特征进行拼接，得到各个局部区域的区域特征，相当于将局部区域的像素级视觉信息和对应的像素级位置信息组合为像素级区域信息，然后提取目标图像的第一图像特征，并将第一图像特征以及多个区域特征拼接为目标拼接特征，然后基于目标拼接特征进行文本预测，生成目标对象的目标实体标识，在文本预测过程中，通过对目标拼接特征中的各个特征进行交互，既能关注第一图像特征所捕捉的全局视觉信息，又能关注各个候选对象对应的像素级区域信息，还能关注目标对象对应的像素级区域信息，因此，能够有效提高对目标图像的全局图像特征以及像素级细节的理解，从而提高目标实体标识的预测准确率，另外，将像素级的查询掩模作为视觉提示，能够高效、灵活且准确地指代目标对象，从而进一步提高目标实体标识的预测准确率。The above-mentioned identification generation device 1100 and the identification generation method are based on the same inventive concept, by obtaining the local mask corresponding to each candidate object in the target image, and determining the query mask used to indicate the target object, and then obtaining the query mask and the mask visual features corresponding to each local mask through feature extraction, and extracting the query mask and the mask position features corresponding to each local mask, the mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located, the mask position features can capture the pixel-level position information of the local area where the corresponding object is located, and then each mask visual feature is spliced with the corresponding mask position features respectively to obtain the regional features of each local area, which is equivalent to combining the pixel-level visual information of the local area and the corresponding pixel-level position information into pixel-level regional information, and then The first image feature of the target image is then extracted, and the first image feature and multiple regional features are spliced into a target splicing feature. Then, text prediction is performed based on the target splicing feature to generate a target entity identifier of the target object. In the text prediction process, by interacting with each feature in the target splicing feature, it is possible to pay attention to both the global visual information captured by the first image feature and the pixel-level regional information corresponding to each candidate object and the pixel-level regional information corresponding to the target object. Therefore, it is possible to effectively improve the understanding of the global image features and pixel-level details of the target image, thereby improving the prediction accuracy of the target entity identifier. In addition, using the pixel-level query mask as a visual cue can efficiently, flexibly and accurately refer to the target object, thereby further improving the prediction accuracy of the target entity identifier.

本公开实施例提供的用于执行上述标识生成方法的电子设备可以是终端，参照图12，图12为本公开实施例提供的终端的部分结构框图，该终端包括：摄像头组件1210、第一存储器1220、输入单元1230、显示单元1240、传感器1250、音频电路1260、无线保真(wireless fidelity， WiFi)模块1270、第一处理器1280、以及第一电源1290等部件。本领域技术人员可以理解，图12中示出的终端结构并不构成对终端的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。The electronic device for executing the above-mentioned identification generation method provided in the embodiment of the present disclosure may be a terminal. Referring to FIG. 12, FIG. 12 is a partial structural block diagram of the terminal provided in the embodiment of the present disclosure, and the terminal includes: a camera assembly 1210, a first memory 1220, an input unit 1230, a display unit 1240, a sensor 1250, an audio circuit 1260, a wireless fidelity (WiFi) module 1270, a first processor 1280, and a first power supply 1290. It can be understood by those skilled in the art that the terminal structure shown in FIG. 12 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.

摄像头组件1210可用于采集图像或视频。可选地，摄像头组件1210包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR（Virtual Reality，虚拟现实）拍摄功能或者其它融合拍摄功能。The camera assembly 1210 can be used to capture images or videos. Optionally, the camera assembly 1210 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal, and the rear camera is disposed on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth of field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth of field camera to realize the background blur function, the fusion of the main camera and the wide-angle camera to realize the panoramic shooting and the VR (Virtual Reality) shooting function or other fusion shooting functions.

第一存储器1220可用于存储软件程序以及模块，第一处理器1280通过运行存储在第一存储器1220的软件程序以及模块，从而执行终端的各种功能应用以及数据处理。The first memory 1220 may be used to store software programs and modules. The first processor 1280 executes various functional applications and data processing of the terminal by running the software programs and modules stored in the first memory 1220 .

输入单元1230可用于接收输入的数字或字符信息，以及产生与终端的设置以及功能控制有关的键信号输入。具体地，输入单元1230可包括触摸面板1231以及其他输入装置1232。The input unit 1230 may be used to receive input digital or character information and generate key signal input related to the terminal's settings and function control. Specifically, the input unit 1230 may include a touch panel 1231 and other input devices 1232 .

显示单元1240可用于显示输入的信息或提供的信息以及终端的各种菜单。显示单元1240可包括显示面板1241。The display unit 1240 may be used to display input information or provided information and various menus of the terminal. The display unit 1240 may include a display panel 1241 .

音频电路1260、扬声器1261，传声器1262可提供音频接口。The audio circuit 1260 , the speaker 1261 , and the microphone 1262 may provide an audio interface.

第一电源1290可以是交流电、直流电、一次性电池或可充电电池。The first power source 1290 may be alternating current, direct current, a disposable battery, or a rechargeable battery.

传感器1250的数量可以为一个或者多个，该一个或多个传感器1250包括但不限于：加速度传感器、陀螺仪传感器、压力传感器、光学传感器等等。其中：The number of sensors 1250 may be one or more, and the one or more sensors 1250 include but are not limited to: acceleration sensors, gyroscope sensors, pressure sensors, optical sensors, etc. Among them:

加速度传感器可以检测以终端建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器可以用于检测重力加速度在三个坐标轴上的分量。第一处理器1280可以根据加速度传感器采集的重力加速度信号，控制显示单元1240以横向视图或纵向视图进行用户界面的显示。加速度传感器还可以用于游戏或者用户的运动数据的采集。The acceleration sensor can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal. For example, the acceleration sensor can be used to detect the components of gravity acceleration on the three coordinate axes. The first processor 1280 can control the display unit 1240 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor. The acceleration sensor can also be used for collecting motion data of games or users.

陀螺仪传感器可以检测终端的机体方向及转动角度，陀螺仪传感器可以与加速度传感器协同采集用户对终端的3D动作。第一处理器1280根据陀螺仪传感器采集的数据，可以实现如下功能：动作感应（比如根据用户的倾斜操作来改变UI）、拍摄时的图像稳定、游戏控制以及惯性导航。The gyroscope sensor can detect the body direction and rotation angle of the terminal, and the gyroscope sensor can cooperate with the acceleration sensor to collect the user's 3D actions on the terminal. The first processor 1280 can implement the following functions based on the data collected by the gyroscope sensor: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

压力传感器可以设置在终端的侧边框和/或显示单元1240的下层。当压力传感器设置在终端的侧边框时，可以检测用户对终端的握持信号，由第一处理器1280根据压力传感器采集的握持信号进行左右手识别或快捷操作。当压力传感器设置在显示单元1240的下层时，由第一处理器1280根据用户对显示单元1240的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor can be set in the side frame of the terminal and/or the lower layer of the display unit 1240. When the pressure sensor is set in the side frame of the terminal, the user's holding signal of the terminal can be detected, and the first processor 1280 performs left and right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor. When the pressure sensor is set in the lower layer of the display unit 1240, the first processor 1280 controls the operability controls on the UI interface according to the user's pressure operation on the display unit 1240. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

光学传感器用于采集环境光强度。在一个实施例中，第一处理器1280可以根据光学传感器采集的环境光强度，控制显示单元1240的显示亮度。具体地，当环境光强度较高时，调高显示单元1240的显示亮度；当环境光强度较低时，调低显示单元1240的显示亮度。在另一个实施例中，第一处理器1280还可以根据光学传感器采集的环境光强度，动态调整摄像头组件1210的拍摄参数。The optical sensor is used to collect the ambient light intensity. In one embodiment, the first processor 1280 can control the display brightness of the display unit 1240 according to the ambient light intensity collected by the optical sensor. Specifically, when the ambient light intensity is high, the display brightness of the display unit 1240 is increased; when the ambient light intensity is low, the display brightness of the display unit 1240 is decreased. In another embodiment, the first processor 1280 can also dynamically adjust the shooting parameters of the camera assembly 1210 according to the ambient light intensity collected by the optical sensor.

在本实施例中，该终端所包括的第一处理器1280可以执行前面实施例的标识生成方法。In this embodiment, the first processor 1280 included in the terminal can execute the identifier generation method of the previous embodiment.

本公开实施例提供的用于执行上述标识生成方法的电子设备也可以是服务器，参照图13，图13为本公开实施例提供的服务器的部分结构框图，服务器可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上第二处理器1310和第二存储器1330，一个或一个以上存储应用程序1343或数据1342的存储介质1340(例如一个或一个以上海量存储装置)。其中，第二存储器1330和存储介质1340可以是短暂存储或持久存储。存储在存储介质1340的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，第二处理器1310可以设置为与存储介质1340通信，在服务器上执行存储介质1340中的一系列指令操作。The electronic device for executing the above-mentioned identification generation method provided by the embodiment of the present disclosure may also be a server. Referring to FIG. 13, FIG. 13 is a partial structural block diagram of the server provided by the embodiment of the present disclosure. The server may have relatively large differences due to different configurations or performances, and may include one or more second processors 1310 and a second memory 1330, and one or more storage media 1340 (e.g., one or more mass storage devices) storing application programs 1343 or data 1342. Among them, the second memory 1330 and the storage medium 1340 may be short-term storage or permanent storage. The program stored in the storage medium 1340 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the second processor 1310 may be configured to communicate with the storage medium 1340 and execute a series of instruction operations in the storage medium 1340 on the server.

服务器还可以包括一个或一个以上第二电源1320，一个或一个以上有线或无线网络接口1350，一个或一个以上输入输出接口1360，和/或，一个或一个以上操作系统1341，例如Windows ServerTM，Mac OS XTM，UnixTM，LinuxTM，FreeBSDTM等等。The server may also include one or more second power supplies 1320, one or more wired or wireless network interfaces 1350, one or more input and output interfaces 1360, and/or one or more operating systems 1341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

服务器中的第二处理器1310可以用于执行标识生成方法。The second processor 1310 in the server may be configured to execute the identification generating method.

本公开实施例还提供一种计算机可读存储介质，计算机可读存储介质用于存储计算机程序，计算机程序用于执行前述各个实施例的标识生成方法。The embodiments of the present disclosure further provide a computer-readable storage medium, which is used to store a computer program, and the computer program is used to execute the identification generation method of each of the aforementioned embodiments.

本公开实施例还提供了一种计算机程序产品，该计算机程序产品包括计算机程序，该计算机程序存储在计算机可读存介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序，处理器执行该计算机程序，使得该计算机设备执行实现上述的标识生成方法。The embodiment of the present disclosure also provides a computer program product, which includes a computer program stored in a computer-readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the above-mentioned identification generation method.

本公开的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或装置不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或装置固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification of the present disclosure and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present disclosure described herein can, for example, be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

应当理解，在本公开中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in the present disclosure, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

应了解，在本公开实施例的描述中，多个（或多项）的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。It should be understood that in the description of the embodiments of the present disclosure, the meaning of multiple (or multiple items) is more than two, greater than, less than, exceed, etc. are understood to not include the number, and above, below, within, etc. are understood to include the number.

在本公开所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本公开各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机装置(可以是个人计算机，服务器，或者网络装置等)执行本公开各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), disk or optical disk and other media that can store program codes.

还应了解，本公开实施例提供的各种实施方式可以任意进行组合，以实现不同的技术效果。It should also be understood that the various implementations provided in the embodiments of the present disclosure can be combined arbitrarily to achieve different technical effects.

以上是对本公开的较佳实施进行了具体说明，但本公开并不局限于上述实施方式，熟悉本领域的技术人员在不违背本公开精神的共享条件下还可作出种种等同的变形或替换，这些等同的变形或替换均包括在本公开权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present disclosure, but the present disclosure is not limited to the above-mentioned implementation mode. Technical personnel familiar with the field can also make various equivalent deformations or substitutions under the shared conditions without violating the spirit of the present disclosure. These equivalent deformations or substitutions are all included in the scope defined by the claims of the present disclosure.