CN118629042B

Movatterモバイル変換

Info

Publication number: CN118629042B
Application number: CN202411111303.0A
Authority: CN
Inventors: 徐正斐; 刘庆斌; 李丽丽; 郝彦超; 李博; 陈曦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-08-14
Filing date: 2024-08-14
Publication date: 2024-12-03
Anticipated expiration: 2044-08-14
Also published as: CN118629042A

Abstract

The embodiment of the disclosure discloses a method, a device, electronic equipment and a storage medium for generating an identifier, wherein the method comprises the steps of obtaining local masks corresponding to candidate objects in a target image, determining query masks in the local masks, wherein the query masks are used for indicating selected target objects in the candidate objects, respectively extracting features based on the query masks and the local masks to the target image to obtain a plurality of mask visual features, respectively extracting mask position features corresponding to the query masks and the local masks, respectively splicing the mask visual features with the corresponding mask position features to obtain a plurality of region features, respectively extracting first image features of the target image, splicing the first image features and the region features to obtain target splicing features, and performing text prediction based on the target splicing features to generate target entity identifiers of the target objects.

Description

Identification generation method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a method and a device for generating an identifier, an electronic device and a storage medium.

Background

Visual entity linking refers to matching objects in an image with corresponding entities in a knowledge base. In the related art, an image to be identified is generally encoded into a global image feature, while query text describing a target object in the image to be identified is used as a visual cue, and then an entity identifier corresponding to the image is predicted based on the global image feature and the text feature of the query text. However, global image features typically ignore local details in the image, resulting in loss of information, thereby reducing the prediction accuracy of entity identification.

Disclosure of Invention

The following is a summary of the subject matter of the detailed description of the present disclosure. This summary is not intended to limit the scope of the claims.

The embodiment of the disclosure provides a method, a device, electronic equipment and a storage medium for generating an identifier, which can improve the prediction accuracy of a target entity identifier.

In one aspect, an embodiment of the present disclosure provides a method for generating an identifier, including:

Obtaining local masks corresponding to each candidate object in a target image, and determining a query mask in a plurality of the local masks, wherein the query mask is used for indicating a selected target object in the plurality of candidate objects;

Extracting the characteristics of the query mask and each local mask based on the target image to obtain a plurality of mask visual characteristics, and extracting mask position characteristics corresponding to the query mask and each local mask;

splicing each mask visual feature with the corresponding mask position feature to obtain a plurality of region features;

Extracting a first image feature of the target image, and splicing the first image feature and a plurality of the region features to obtain a target splicing feature;

and carrying out text prediction based on the target splicing characteristics to generate a target entity identifier of the target object.

On the other hand, the embodiment of the disclosure also provides an identifier generating device, which comprises:

The acquisition module is used for acquiring local masks corresponding to each candidate object in the target image, and determining a query mask in a plurality of the local masks, wherein the query mask is used for indicating a selected target object in the plurality of candidate objects;

The feature extraction module is used for extracting features of the target image based on the query mask and the local masks respectively to obtain a plurality of mask visual features, and extracting mask position features corresponding to the query mask and the local masks;

The first splicing module is used for splicing each mask visual feature with the corresponding mask position feature to obtain a plurality of region features;

The second splicing module is used for extracting first image features of the target image, and splicing the first image features and the plurality of region features to obtain target splicing features;

and the generation module is used for carrying out text prediction based on the target splicing characteristics and generating a target entity identifier of the target object.

Further, the second splicing module is specifically configured to:

Determining the mask area of each local mask respectively, and splicing the region features corresponding to each local mask according to the size sequence of the mask areas to obtain first splicing features;

And splicing the first image feature, the first splicing feature and the region feature corresponding to the query mask to obtain a target splicing feature.

Further, the second splicing module is specifically configured to:

constructing a prompt text for prompting the first large language model to generate entity identification;

And extracting text features of the prompt text, and splicing the first image features, the text features, the first splicing features and the region features corresponding to the query mask to obtain target splicing features.

Further, the feature extraction module is specifically configured to:

extracting multi-level features of the target image to obtain multi-level visual features of the target image;

Mask pooling is carried out on the multi-level visual features based on the query mask and each local mask respectively, so that a plurality of multi-level pooled features are obtained;

And respectively carrying out feature fusion on each multi-level pooling feature to obtain a plurality of mask visual features.

Further, the feature extraction module is specifically configured to:

Mapping sub-features of each level in the multi-level pooling features to obtain a plurality of intermediate features with the same dimension, and carrying out feature fusion on each intermediate feature to obtain fusion features;

And respectively carrying out multi-layer perception processing on each fusion feature to obtain a plurality of mask visual features.

Further, the above-mentioned acquisition module is specifically configured to:

Acquiring a target image and a prompt mark of the target image, wherein the target image comprises a plurality of candidate objects, and the prompt mark is used for indicating a selected target object in the plurality of candidate objects;

and dividing the target image to obtain local masks corresponding to the candidate objects, and determining a query mask in the local masks based on the prompt marks.

Further, the above-mentioned acquisition module is specifically configured to:

When the prompt marks are marked points, determining query masks in the local masks based on the position relation between the marked points and the local masks;

Or when the prompt mark is a mark frame, determining a query mask in each local mask based on the matching degree between the mark frame and the mask boundary of each local mask;

or when the prompt mark is a mark area, determining a query mask in each local mask based on the matching degree between the mark area and each local mask.

Further, the target entity identifier is generated by a first large language model, and the identifier generating device further comprises a training module, wherein the training module is specifically configured to:

Obtaining a sample image and a second mask corresponding to a sample object in the sample image, and dividing the sample image to obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one object of a plurality of visual objects;

Extracting second image features of the sample image, extracting features of the second mask and the first masks based on the sample image to obtain a plurality of sample visual features, and extracting sample position features corresponding to the second mask and the first masks;

the second image features, the sample visual features and the sample position features are spliced and then input into the first large language model for text prediction, and a prediction probability distribution is generated, wherein the prediction probability distribution is used for determining entity identification of the sample object;

and acquiring a first entity identifier of a sample entity linked with the sample image, determining a model loss based on the prediction probability distribution and the first entity identifier, and training the first large language model based on the model loss, wherein the sample entity is used for indicating the sample object.

Further, the sample image, the second mask, and the first entity identification are all acquired from a dataset, and the training module is further configured to:

Acquiring a plurality of original images and query texts corresponding to the original images, and respectively determining identification information corresponding to the original images according to the original images and the corresponding query texts, wherein the query texts are used for prompting and identifying the concerned objects in the corresponding original images;

Acquiring a plurality of candidate entities, and respectively determining a link entity corresponding to each original image in the plurality of candidate entities based on each piece of identification information;

Determining a labeling mask of each original image based on each original image and the corresponding query text, wherein the labeling mask is used for indicating the corresponding object of interest;

And storing each original image, the corresponding annotation mask and the corresponding link entity in the data set in an associated mode, wherein the sample image is obtained by sampling a plurality of original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the link entity linked by the sample image.

Further, the training module is specifically configured to:

Respectively inputting each inquiry text into a second large language model to conduct text prediction, and generating a summary text corresponding to each original image;

Performing object detection based on each original image and the corresponding generalized text respectively, and generating an original boundary box corresponding to each original image, wherein the original boundary box is used for indicating the corresponding object of interest;

And respectively inputting each original image and the corresponding original boundary box into a first mask generation model to perform mask prediction, and generating a labeling mask corresponding to each original image.

Further, the training module is further configured to:

acquiring a reference name text corresponding to each link entity, wherein the reference name text is used for indicating the name of the reference entity, and the hierarchy of the reference entity in a knowledge base is higher than that of the link entity in the knowledge base;

respectively inputting each original image and the corresponding reference name text into a second mask generation model to perform mask prediction, and generating a reference mask corresponding to each original image;

determining the matching degree between each labeling mask and the corresponding reference mask to obtain the corresponding target matching degree of each labeling mask;

And when the target matching degree is smaller than a preset matching degree threshold, eliminating the original image corresponding to the target matching degree.

Further, the training module is further configured to:

counting the number of the connected areas in each marking mask to obtain the number of the areas corresponding to each marking mask;

and when the number of the areas is larger than a preset number threshold, eliminating the original images corresponding to the number of the areas.

Further, the training module is specifically configured to:

Acquiring a sample name text of a sample entity linked with the sample image, and performing word segmentation on the sample name text to obtain a plurality of first word segments;

Determining the occurrence frequency of each first word in a knowledge base, sorting each first word based on the order of the occurrence frequency from small to large, and determining the first word arranged in the previous L bits as a second word, wherein L is a positive integer;

And determining a first entity identification of the sample entity based on each second word.

On the other hand, the embodiment of the disclosure also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the identification generation method when executing the computer program.

In another aspect, embodiments of the present disclosure further provide a computer readable storage medium storing a computer program, where the computer program is executed by a processor to implement the above-described identifier generating method.

In another aspect, the disclosed embodiments also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute the identification generation method described above.

The embodiment of the disclosure at least comprises the following beneficial effects that the local masks corresponding to all candidate objects in the target image are obtained, the query mask for indicating the target object is determined, further, the query mask and mask visual features corresponding to all the local masks are obtained through feature extraction, the mask visual features can capture the pixel-level visual information of the local area where the corresponding object is located, then each mask visual feature is spliced with the corresponding mask visual features to obtain the area features of all the local areas, the area features are equivalent to combining the pixel-level visual information of all the local areas with the corresponding pixel-level visual information of the corresponding pixel-level areas to form pixel-level area information, then the first image features of the target image are extracted, the first image features and the plurality of area features are spliced to form target splicing features, then, text prediction is carried out based on the target splicing features to generate target entity identifiers of the target object, in the text prediction process, the global visual features can capture the global visual information of the local area where the corresponding to the corresponding pixel-level visual features of the corresponding to the corresponding object is located, and the target object can be further understood as the target entity-level visual information of the target object is further, in addition, the target-level visual information can be accurately understood, and the target-level image-level can be accurately predicted, and the target-level image can be compared to the target-level image is correspondingly, and can be understood, and the target-level-regarded as the target-level visual information is better, thereby further improving the prediction accuracy of the target entity identification.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is a schematic illustration of an alternative implementation environment provided by embodiments of the present disclosure;

FIG. 2 is a schematic flow chart of an alternative method for generating a label according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of an alternative flow for determining mask visual features provided by embodiments of the present disclosure;

FIG. 4 is a schematic flow chart of an alternative method for generating a mask according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of an alternative method for updating a data set according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an alternative architecture for rejecting original images provided by embodiments of the present disclosure;

FIG. 7 is a schematic diagram of an alternative distribution of entity categories in a plurality of sample sets provided by embodiments of the present disclosure;

FIG. 8 is an alternative pie chart of optimizing entity categories in a sample set provided by embodiments of the present disclosure;

FIG. 9 is a schematic diagram of an alternative distribution of area ratios of a labeling mask provided by embodiments of the present disclosure;

FIG. 10 is a schematic diagram of an alternative architecture for the training phase provided by embodiments of the present disclosure;

FIG. 11 is a schematic diagram of an alternative configuration of a label generating apparatus according to an embodiment of the present disclosure;

Fig. 12 is a partial block diagram of a terminal according to an embodiment of the present disclosure;

fig. 13 is a partial block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

In the various embodiments of the present disclosure, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, and the like of the data. Wherein the target object may be a user. In addition, when the embodiment of the present disclosure needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object may be acquired through a popup window or a jump to a confirmation page, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the present disclosure to function normally is acquired.

In the presently disclosed embodiments, the term "module" or "unit" refers to a computer program or a portion of a computer program having a predetermined function and working with other related portions to achieve a predetermined objective, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present disclosure, some key terms used in the embodiments of the present disclosure are explained here:

Visual ENTITY LINKING, VEL refers to matching objects in the image with corresponding entities in the knowledge base.

Based on the above, the embodiment of the disclosure provides a method, a device, an electronic device and a storage medium for generating a target entity identifier, which can improve the prediction accuracy of the target entity identifier.

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative implementation environment provided in an embodiment of the present disclosure, where the implementation environment includes a terminal 101 and a server 102, where the terminal 101 and the server 102 are connected through a communication network.

The server 102 may obtain a target image sent by a terminal, then obtain a local mask corresponding to each candidate object in the target image, determine a query mask in a plurality of local masks, where the query mask is used to indicate a selected target object in the plurality of candidate objects, extract a plurality of mask visual features based on the query mask and feature extraction of each local mask, extract mask position features corresponding to the query mask and each local mask, splice each mask visual feature with corresponding mask position features to obtain a plurality of region features, extract a first image feature of the target image, splice the first image feature and the plurality of region features to obtain a target spliced feature, perform text prediction based on the target spliced feature, generate a target entity identifier of the target object, and then send the target entity identifier to the terminal 101.

The server 102 obtains local masks corresponding to each candidate object in the target image, and determines a query mask for indicating the target object, further extracts mask visual features corresponding to the query mask and each local mask through feature extraction, extracts mask position features corresponding to the query mask and each local mask, the mask visual features can capture pixel-level visual information of a local area where the corresponding object is located, the mask position features can capture pixel-level position information of the local area where the corresponding object is located, then each mask visual feature is spliced with the corresponding mask position features respectively to obtain area features of each local area, the area features are equivalent to combining pixel-level visual information and corresponding pixel-level position information of the local area into pixel-level area information, then extracts first image features of the target image, splices the first image features and the plurality of area features into target splicing features, then carries out text prediction based on the target splicing features to generate a target entity identifier of the target object, and in the text prediction process, the global visual information captured by the first image features can be focused on each feature, and then the pixel-level visual information corresponding to each candidate object can be focused on the target object, and the target entity can be further accurately predicted by the target entity, and the target entity can be further improved, and the accuracy of the target entity can be further improved.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. In addition, server 102 may also be a node server in a blockchain network.

The terminal 101 may be, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and embodiments of the present disclosure are not limited herein.

Referring to fig. 2, fig. 2 is a schematic flowchart of an alternative method for generating a identifier, which may be executed by a server, or may be executed by a terminal, or may be executed by a server in conjunction with a terminal, according to an embodiment of the present disclosure, where the method includes, but is not limited to, the following steps 201 to 205.

Step 201, obtaining local masks corresponding to each candidate object in the target image, and determining a query mask from a plurality of local masks.

The target image refers to an image that needs to be linked by a visual entity, and the target image may include a plurality of candidate objects, for example, the candidate objects may include main objects such as animals, plants, articles, and the candidate objects may also include background objects such as natural landscapes, buildings, and city streets. The method comprises the steps of selecting a target object from a plurality of candidate objects, respectively, using each local mask to indicate the corresponding candidate object, using each local mask to indicate the region where the corresponding candidate object is located in the target image, and using the query mask to indicate the selected target object from the plurality of candidate objects, wherein the query mask can indicate the local region where the target object is located in the target image, and using the query mask to indicate the target object.

Specifically, each local mask may be a binary image with the same size as the target image, in any local mask, a region of interest formed by all pixels with pixel values of 1 is used to indicate a local region where a corresponding candidate object in the target image is located, and a non-region of interest formed by all pixels with pixel values of 0 is used to indicate the rest of the region in the target image.

Based on this, since the query mask is determined among a plurality of local masks, the query mask is determined specifically based on the selection of the relevant person, the query mask is capable of characterizing the pointing intent of the relevant person, the target object is an object selected by the relevant person, and the target object may be any one of a plurality of candidate objects.

And 202, respectively extracting features of the target image based on the query mask and each local mask to obtain a plurality of mask visual features, and extracting mask position features corresponding to the query mask and each local mask.

The method comprises the steps of respectively carrying out feature extraction based on a query mask and each local mask on a target image, specifically carrying out feature extraction based on the query mask on the target image and carrying out feature extraction based on each local mask on the target image, assuming that the number of the local masks is two, respectively carrying out feature extraction based on each local mask on the target image, specifically carrying out feature extraction based on a first local mask on the target image and carrying out feature extraction based on a second local mask on the target image, and obtaining corresponding mask visual features each time the feature extraction is carried out on the target image.

Based on the method, mask visual characteristics are obtained by extracting mask-based characteristics from the target image, which is equivalent to extracting more abstract and more informative local visual characteristics from the target image, the mask visual characteristics can capture pixel-level visual information of a local area where a corresponding object is located because the query mask refers to the target object and the local mask refers to a corresponding candidate object, and in addition, the mask position characteristics corresponding to the query mask and each local mask are extracted, which is equivalent to respectively determining spatial position relations among the masks, and the mask position characteristics can capture pixel-level position information of the local area where the corresponding object is located because the query mask refers to the target object and the local mask refers to the corresponding candidate object.

In one possible implementation manner, the mask position features corresponding to the query mask and each local mask are extracted, specifically, the query mask and each local mask are flattened and then input to a position encoder for mapping, so as to obtain the mask position features corresponding to the query mask and each local mask.

And 203, splicing each mask visual feature with the corresponding mask position feature to obtain a plurality of region features.

Each mask visual feature is spliced with a corresponding mask position feature, specifically, the mask visual feature corresponding to each local mask is spliced with a corresponding mask position feature, and the mask visual feature corresponding to the query mask is spliced with a corresponding mask position feature.

For example, for any one of the query mask and each partial mask, the mask visual feature corresponding to the mask can be spliced at the head end or the tail end of the mask position feature corresponding to the mask to obtain the region feature corresponding to the mask, so that any one of the query mask and each partial mask has the region feature corresponding to the mask, and the region feature is the combination result of the mask visual feature and the mask position feature corresponding to the mask.

Based on the above, the region characteristics of the query mask and the region characteristics corresponding to each mask in each local mask can be obtained through splicing, namely, the region characteristics of the local region where each object is located are obtained, the region characteristics of the local region are equivalent to the local characteristics of the target image, specifically, the mask visual characteristics and the mask position characteristics are spliced, namely, the pixel-level visual information and the corresponding pixel-level position information of the local region where each object is located in the target object and each candidate object are equivalent to the combination, and the pixel-level region information of the local region where each object is located can be obtained, therefore, the region characteristics of the local region where each object is located can be represented, and the understanding of the pixel-level details of the target image can be improved based on the pixel-level region information.

And 204, extracting first image features of the target image, and splicing the first image features and the plurality of region features to obtain target splicing features.

The first image feature and the plurality of region features are spliced, specifically, the plurality of region features can be spliced in sequence to obtain a splicing result, and then the first image feature is spliced at the head end of the splicing result to obtain a target splicing feature.

Based on the method, the first image feature of the target image is extracted, which is equivalent to extracting the global image feature which is more abstract and has more information content from the target image, the first image feature can capture the global visual information of the target image, the first image feature and the plurality of region features are spliced into target splicing features, and then the global visual information and the pixel-level region information can be focused based on the target splicing features.

And 205, carrying out text prediction based on the target splicing characteristics to generate a target entity identifier of the target object.

The target entity identifier is used for indicating target entities in the knowledge base, and the target entity identifier has uniqueness, namely different entities in the knowledge base can correspond to different target entity identifiers, and after the target entity identifier is generated, a target object can be linked with the target entity indicated by the target entity identifier based on the target entity identifier.

In particular, the target entity identifier may be a single identifier or may be a sequence composed of a plurality of identifiers, and embodiments of the disclosure are not limited herein, for example, the target entity identifier may be a sequence of [50,10,3], where the target entity identifier includes 3 identifiers, 50,10, and 3, respectively.

Notably, the target entity identification is used to indicate entities in a knowledge base where each entity is typically designed to represent a unique object and corresponds to a globally unique tag, different tags being able to indicate different entities, e.g., assuming that the name of an entity in the knowledge base is "golf course", the object to which the entity corresponds is golf course, which may be represented as e=q 1048XXX, e being entity (entity), Q1048XXX being tag.

Based on the above, in the text prediction process, through interaction of each feature in the target stitching feature, not only global visual information captured by the first image feature, but also pixel-level region information corresponding to each candidate object, and also pixel-level region information corresponding to the target object can be focused, so that understanding of global image features and pixel-level details of the target image can be effectively improved, and accordingly prediction accuracy of target entity identification is improved.

In one possible implementation manner, the first image feature and the plurality of region features are spliced to obtain a target spliced feature, specifically, the mask areas of the local masks are respectively determined, the region features corresponding to the local masks are spliced according to the sequence of the mask areas to obtain the first spliced feature, and the first image feature, the first spliced feature and the region features corresponding to the query mask are spliced to obtain the target spliced feature.

The mask area of the local mask refers to the area of the region of interest in the local mask, and the region of interest in the local mask is generally formed by all pixel points with pixel values of 1, where the region of interest is used for indicating the local area where the corresponding candidate object is located in the target image, so when the mask area is larger, the larger the local area where the corresponding candidate object is located is represented, that is, the larger the space occupied by the candidate object in the target image is, the higher the visual saliency of the candidate object is, whereas when the mask area is smaller, the smaller the local area where the corresponding candidate object is represented, that is, the smaller the space occupied by the candidate object in the target image is, the lower the visual saliency of the candidate object is represented.

Based on the above, the region features corresponding to the local masks are spliced according to the size sequence of the mask areas, specifically, the local masks are sequenced according to the size sequence of the mask areas, which is equivalent to sequencing the local masks according to the high-low sequence of visual saliency, and then the region features corresponding to the local masks after the sequencing are spliced in sequence to obtain the first spliced feature, so that in the text prediction process, pixel-level region information of the local region where each candidate object is located can be sequentially focused according to a fixed visual attention sequence, and the understanding of pixel-level details of the target image is effectively realized.

In addition, the first image feature, the first stitching feature and the region feature corresponding to the query mask are stitched to obtain the target stitching feature, so that in the text prediction process, the global visual information captured by the first image feature can be focused, understanding of the global image feature of the target image can be effectively improved, pixel-level region information of a local region where the target object is located can be focused, efficient, flexible and accurate pointing to the target object through the query mask can be realized, and the prediction accuracy of the target entity identification can be further improved.

For example, the local masks are ordered according to the order of the mask areas from large to small, so that the occupation space of the candidate objects corresponding to the region features spliced in turn is smaller and smaller in the first splicing feature, in the text prediction process, the region features corresponding to the candidate objects with larger occupation space are focused first, then the region features corresponding to the candidate objects with smaller occupation space are focused, the human visual attention order is simulated, and the candidate objects with larger occupation space can be focused more widely in the text prediction process, so that the prediction accuracy of the target entity identification is improved.

For another example, the local masks are ordered according to the order of the mask areas from small to large, so that in the first stitching feature, the occupation space of the candidate object corresponding to each region feature stitched in turn is larger and larger, in the text prediction process, the region feature corresponding to the candidate object with smaller occupation space is focused first, and then the region feature corresponding to the candidate object with larger occupation space is focused, so that the prediction accuracy of the target entity identification is improved in a specific scene.

Specifically, since the first image feature may be regarded as a global feature of the target image, and the region feature may be regarded as a local feature of the target image, the feature obtained by stitching the first image feature and the first stitching feature may be regarded as a composite image feature, and taking the case that each local mask is sequenced from the big to the small according to the mask area, the determination formula of the composite image feature is as follows:

Wherein,In order to integrate the features of the image,As a first image feature of the image,For the first partial mask to be used,For the second partial mask to be used,Is the firstThe number of partial masks to be used,For the region feature corresponding to the first partial mask,For the region feature corresponding to the second partial mask,Is the firstThe features of the region corresponding to the individual partial masks,For a first stitched feature stitched from a plurality of regional features, the first stitched feature can be considered as a local feature sequence comprising a plurality of regional features, i.e,Is thatTo the point ofIs characterized by any one of the region features,For the length of the local feature sequence, i.e.For the number of local features in the sequence of local features, it can be seen that the first stitching feature can also be expressed as。

In addition, the size relationship between the mask areas of the respective partial masks satisfies the following formula:

Wherein,Is the firstThe mask area of the individual partial masks,Is the firstThe mask area of each partial mask, i.e. the mask area of the previous partial mask is greater than or equal to the mask area of the partial mask, ensures that the partial masks are ordered in order of the mask areas, i.eIs obtained by sorting the features of each region based on the descending order of mask area.

Notably, assuming that the mask visual features and mask position features are considered extracted by the mask-aware visual extractor and each local mask is segmented by the semantic segmentation model, the first stitching feature is determined as follows:

Wherein,As a first stitching feature, a second stitching feature,In order to be an image of the object,In the case of any one of the partial masks,Is a local maskThe visual characteristics of the corresponding mask are that,Is a local maskThe corresponding mask position features are used to define a mask position,For the mask-aware visual extractor,Is a semantic segmentation model.

It should be noted that the number of the substrates,、AndIn (a) and (b)In order to be an image of the object,In (a) and (b)Refers to global (global), whichFor indicating that the first image feature is a global feature of the target image,In (a) and (b)Refers to local (local) area, whichFor indicating that the region feature in the first stitching feature is a local feature of the target image.

In one possible implementation manner, the first image feature, the first stitching feature and the region feature corresponding to the query mask are stitched to obtain the target stitching feature, which may specifically be that the first image feature, the first stitching feature and the region feature corresponding to the query mask are stitched in sequence to obtain the target stitching feature. Based on the method, the first splicing feature is spliced at the tail end of the first image feature, in the text prediction process, the global feature used for capturing the whole information in the target image can be focused firstly, and then the local feature used for capturing the fine detail in the target image is focused, so that the prediction accuracy of the target entity identification is improved.

In one possible implementation manner, the target entity identifier is generated by a first large language model, the first image feature, the first splicing feature and the region feature corresponding to the query mask are spliced to obtain the target splicing feature, specifically, a prompt text for prompting the first large language model to generate the entity identifier is constructed, the text feature of the prompt text is extracted, and the first image feature, the text feature, the first splicing feature and the region feature corresponding to the query mask are spliced to obtain the target splicing feature.

The first large language model belongs to a large language model (Large Language Model, LLM), which is a deep learning model trained by using a large amount of text data and can generate natural language text or understand meaning of the language text, and generally adopts a cyclic neural network (RNN) or variant, such as a long and short time memory network (LSTM) and a gate-controlled cyclic unit (GRU), to capture context information in a text sequence, thereby realizing tasks such as generation of the natural language text, language model evaluation, text classification, emotion analysis and the like, and has been widely used in the field of natural language processing, such as speech recognition, machine translation, automatic abstract, dialogue system, intelligent question-answering and the like, wherein a target entity identifier belongs to the natural language text, and the first large language model is used for processing the task of generating the target entity identifier.

Based on the above, the Prompt text can be regarded as a Prompt instruction (Prompt), the Prompt instruction can be understood as a mode of starting a large language model, and the Prompt instruction can instruct the large language model to generate content of a specific type, theme or format, so that the Prompt text of the entity identifier is generated by constructing the Prompt first large language model, the target splicing feature comprises the text feature of the Prompt text besides the first image feature, the first splicing feature and the region feature corresponding to the query mask, the target splicing feature is used as the input of the first large language model, the generation quality of the first large language model to the target entity identifier can be improved under the guidance of the Prompt text, and in addition, the understanding of the first large language model to the global feature and the pixel level detail of the target image can be effectively improved by introducing cross attention interaction of the pixel level region feature into the first large language model, so that the prediction accuracy of the target entity identifier is improved.

In one possible implementation manner, the prompt text may include instruction text and reference text, where the instruction text is a prompt instruction, and the reference text is used to refer to the target object, and the reference text is used as a text prompt, so that the target object can be further accurately referred to, and the prediction accuracy of the target entity identifier is further improved.

In one possible implementation manner, feature extraction based on a query mask and each local mask is performed on the target image to obtain a plurality of mask visual features, specifically, multi-level feature extraction is performed on the target image to obtain multi-level visual features of the target image, mask pooling is performed on the multi-level visual features based on the query mask and each local mask to obtain a plurality of multi-level pooled features, and feature fusion is performed on each multi-level pooled feature to obtain a plurality of mask visual features.

Specifically, mask pooling refers to that in a target image, for any one mask of a query mask and each local mask, for a local area where an object is located, multi-level pooling features are obtained by aggregating multi-level visual features of all pixel points in the local area through pooling operation, and mask pooling can be divided into mask operation and pooling operation, and mask operation is described in detail below.

Then, the visual features of the hierarchy are multiplied by the target mask element by element, the visual features falling within the region of interest in the target mask remain, and the visual features falling within other regions in the target mask become 0, resulting in a feature of interest having dimensions 128×128×16.

The pooling operation is described in detail below, and the pooling operation may use average pooling or maximum pooling, and may also use other pooling methods, and embodiments of the disclosure are not limited herein.

Taking the pooling operation as an example, performing mask operation on visual features of each level to obtain corresponding interesting features, assuming that the dimension of a certain interesting feature is 128×128×16, the interesting feature comprises 16 initial feature maps of which the dimension is 128×128, taking the average value of corresponding positions in 16 channels as the feature value of corresponding positions in a new feature map for each position in the initial feature map, so as to generate a single-channel pooling feature map of which the dimension is 128×128, and after performing pooling operation on the interesting feature corresponding to the visual feature of each level, the pooled feature map corresponding to the visual feature of each level can be formed into multi-level pooled features.

The feature fusion is performed on each multi-level pooling feature, specifically, for any multi-level pooling feature, feature fusion is performed on all pooling feature graphs in the multi-level pooling feature, so as to obtain one mask visual feature.

Based on the method, visual features of the target image in different levels can be extracted in multi-level feature extraction, the visual features of the high level are generally more abstract and have more information than the visual features of the low level, the multi-level visual features provide richer and more diverse data representation, the target image can be learned and expressed more comprehensively and effectively to obtain proper multi-level visual features, then pixel-level visual information of a specific area in the target image can be accurately captured in a mask pooling process, visual features are accurately extracted from the target object and local areas where all candidate objects are located to obtain proper multi-level pooling features, different-level pooling feature graphs can be integrated in a feature fusion process to obtain proper mask visual features, fine-granularity visual understanding can be achieved, pixel-level visual information integrity of the specific area can be enhanced, and then in a text prediction process, understanding of pixel-level details of the target image can be effectively improved, so that prediction accuracy of target entity identification is improved.

In one possible implementation manner, the first image feature of the target image is extracted, specifically, multi-level feature extraction is performed on the target image to obtain multi-level visual features of the target image, and the multi-level visual features are input to a first multi-layer sensor for mapping to obtain the first image feature.

Based on the method, the first multi-layer perceptron is used for carrying out multi-layer perception processing, multi-layer visual characteristics are further abstracted through the multi-layer perception processing, and higher-level data representation can be learned, so that the prediction accuracy of target entity identification is improved, and the first multi-layer perceptron and the first large language model can be trained jointly. In addition, when extracting mask visual features and first image features, the same visual encoder can be used for multi-level feature extraction, and by sharing the feature mapping of the visual encoder, additional computation and parameter overhead can be reduced.

In a possible implementation manner, referring to fig. 3, fig. 3 is a schematic flow chart of an alternative procedure for determining mask visual features provided in an embodiment of the present disclosure, feature fusion is performed on each multi-level pooled feature to obtain a plurality of mask visual features, specifically, for any multi-level pooled feature, sub-features of each level in the multi-level pooled feature are mapped to obtain a plurality of intermediate features with the same dimension, feature fusion is performed on each intermediate feature to obtain fused features, and multi-level perception processing is performed on each fused feature to obtain a plurality of mask visual features.

The feature fusion is performed on each intermediate feature, and specifically, each intermediate feature may be summed, or each intermediate feature may be spliced, which is not limited in this disclosure.

Based on the above, the sub-features of each level in the multi-level pooling feature are the pooling feature map, and because the dimensions of the sub-features of different levels are usually different, before feature fusion, each sub-feature needs to be mapped into an intermediate feature with the same dimension, then feature fusion is performed on the intermediate feature with the same dimension to obtain a fused feature, the fused feature can be effectively integrated, higher-level data representation can be learned through multi-level perception processing, and finally appropriate mask visual features are generated, so that richer and more accurate visual information is provided for text prediction, and understanding of pixel level details of a target image is facilitated, so that the prediction accuracy of target entity identification is improved.

Specifically, when the target entity identifier is generated by the first large language model, the multi-layer perception processing can map the fusion feature with visual information into the mask visual feature with language information, which is equivalent to mapping the fusion feature into a feature space matched with the text embedding space, so that the conversion of the feature between different expression spaces is realized, the visual information and the text information input into the first large language model can be effectively integrated, the understanding of the first large language model on the pixel level detail of the target image is facilitated, and the prediction accuracy of the target entity identifier is improved.

In addition, each intermediate feature can be mapped by a corresponding first linear layer, and each fusion feature is processed in a multi-layer sensing way, and specifically, each fusion feature can be respectively input into a second multi-layer sensor for mapping, so that mask visual features corresponding to each fusion feature can be obtained, and the first linear layer, the second multi-layer sensor and the first large language model can be trained jointly.

In one possible implementation manner, local masks corresponding to each candidate object in the target image are obtained, a query mask is determined in a plurality of local masks, specifically, the target image and a prompt sign of the target image are obtained, the target image is segmented to obtain the local mask corresponding to each candidate object, and the query mask is determined in each local mask based on the prompt sign.

Wherein the target image includes a plurality of candidate objects, the prompt mark is used for indicating the selected target object in the plurality of candidate objects, the prompt mark refers to a mark drawn on the target image, for example, the prompt mark may be a point, a line, a frame, a painted area, or the like drawn on the target image.

Specifically, the target image is segmented, the class distribution probability of each pixel point in the target image can be predicted first, the class distribution probability comprises the matching probability value of each candidate object to which the pixel point belongs, for any one pixel point, the candidate object with the highest matching probability value is determined to be the candidate object matched with the pixel point, the local area where each candidate object is located is respectively formed by all matched pixel points, then the local area where each candidate object is located can be accurately segmented in the target image based on the candidate object matched with each pixel point, and then the local mask corresponding to each candidate object can be accurately determined based on the local area where each candidate object is located.

For example, for any candidate object, an initial image identical to the target image is created, in the initial image, the pixel value of the pixel point in the local area where the candidate object is located is assigned to 1, and the pixel value of the pixel point in the other area in the initial image is assigned to 0, so as to obtain the local mask corresponding to the candidate object, or the local mask corresponding to the candidate object may be determined by other manners, and the embodiment of the disclosure is not limited herein.

Based on the method, the local masks corresponding to the candidate objects can be accurately obtained by dividing the target image, the prompt marks can be drawn by related personnel, and the prompt marks can represent the pointing intention of the related personnel, so that the target objects indicated by the prompt marks are specifically objects selected by the related personnel, the query mask is determined in the local masks based on the prompt marks, the pointing intention of the related personnel can be represented by the query mask, and the pointing target objects can be efficiently, flexibly and accurately represented by taking the query mask at the pixel level as a visual prompt, so that the prediction accuracy of the target entity identification is further improved.

Specifically, referring to fig. 4, fig. 4 is a schematic flow chart of an alternative method for generating a mask according to an embodiment of the disclosure, where a target image and a prompt mark may be input to a target mask generating Model for performing mask prediction, and a local mask and a query mask may be generated, for example, the target mask generating Model may use models such as SEGMENT ANYTHING Model (SAM), FAST SEGMENT ANYTHING Model (FastSAM), etc., and the embodiment of the disclosure is not limited herein.

In one possible implementation manner, the query mask is determined in each local mask based on the prompt mark, specifically, when the prompt mark is a mark point, the query mask is determined in each local mask based on the position relationship between the mark point and each local mask;

or when the hint marks are marked areas, determining a query mask in each local mask based on the degree of matching between the marked areas and each local mask.

Wherein the number of the mark points may be one or more, when the number of the mark points is one, the mark points are located within a mask boundary of the query mask, and when the number of the mark points is multiple, the local mask containing the most mark points in the region of interest may be determined as the query mask, or the query mask may be determined by other means, and embodiments of the disclosure are not limited herein.

In the processing of the marking points, the marking points are pixel points in the area of the target object on the target image, whether the marking points are located in the mask boundaries of the local masks or not can be determined by determining the position relation between the marking points and the local masks, namely, whether the concerned area in the local masks contains the marking points or not can be determined as the query mask, when the marking frame is processed, the matching degree between the marking frame and the mask boundaries of the local masks can be calculated because the frame selection area of the marking frame can generally cover the area of the target object, then the local mask with the highest matching degree is determined as the query mask, similarly, when the marking area is processed, the marking area is equivalent to graffiti, and the local mask with the highest matching degree can be calculated and is determined as the query mask because the marking area can generally cover the area of the target object.

It should be noted that the prompt mark may be determined in various manners, for example, a target image and an image mark control are displayed in a display interface, the prompt mark is determined in response to interaction of the image mark control, the image mark control may specifically include an image clicking control, an image framing control, an image smearing control, and the like, when the image mark control is an image clicking control, a pixel point clicked by a related person on the target image can be detected in response to interaction of the image clicking control, and then the pixel point is determined to be a mark point, when the image mark control is an image framing control, a bounding box drawn by the related person on the target image can be detected in response to interaction of the image framing control, and then the bounding box is determined to be a mark box, and when the image mark control is an image smearing control, a smearing area drawn by the related person on the target image can be detected in response to interaction of the image smearing control, and then the smearing area is determined to be a mark area.

For another example, a text entry box is displayed in the display interface, and in response to interaction of the text entry box, content entered by a person associated with the text entry box can be obtained, and a prompt tag can be determined based on the content entered by the text entry box. For example, assuming that the target image is represented as a two-dimensional matrix of x rows and y columns, the content input by the text input box may be "pixels located in x 'th row and y' th column in the target image", and then the pixels located in x 'th row and y' th column in the target image are determined as marker points based on the content input by the text input box, where x '. Ltoreq.x, y'. Ltoreq.y.

In one possible implementation, the query mask is determined in each local mask based on the matching degree between the mark area and each local mask, specifically, the intersection ratio between the mark area and each local mask, that is, the matching degree is specifically the intersection ratio, then the local mask with the largest intersection ratio is determined as the query mask, or the center distance between the mark area and each local mask, that is, the matching degree is specifically the center distance, then the local mask with the largest center distance is determined as the query mask, or other manners may be adopted to determine the query mask, and the embodiment of the disclosure is not limited herein.

Specifically, similar to the processing manner of the mark region, the degree of matching between the mark frame and the mask boundary of each partial mask may also be determined by calculating the intersection ratio or the center distance, or the like.

In a possible implementation manner, the target entity identifier is generated by a first large language model, the first large language model is obtained through training by acquiring a sample image and a second mask corresponding to a sample object in the sample image, segmenting the sample image to obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one object in a plurality of visual objects, extracting second image characteristics of the sample image, respectively extracting characteristics of the sample image based on the second mask and each first mask to obtain a plurality of sample visual characteristics, extracting sample position characteristics corresponding to the second mask and each first mask, inputting the second image characteristics, the sample visual characteristics and the sample position characteristics into the first large language model for text prediction, generating a prediction probability distribution, determining model loss based on the prediction probability distribution and the first entity identifier of the sample entity linked with the sample image, and training the first large language model based on the model loss.

It is noted that, similar to the target image, the sample image refers to an image requiring visual entity linking, similar to the local masks, each first mask is used for indicating a corresponding visual object, the first mask is capable of indicating a local area where the corresponding visual object is located in the sample image, similar to the query mask, the second mask is used for indicating a sample object, similar to the target object, the sample object corresponds to an object selected from a plurality of visual objects, similar to the mask visual features, the sample visual features are capable of capturing pixel-level visual information of the local area where the corresponding object is located, and similar to the mask position features, the sample position features are capable of capturing pixel-level position information of the local area where the corresponding object is located.

Wherein the sample entity is used to indicate sample objects, the sample entity may be a label of an entity in a knowledge base, in which each entity is typically designed to represent a unique object and corresponds to a globally unique label, and different labels can indicate different entities, for example, the sample entity may be e=q10000 XX, e is an entity (entity), and Q10000XX is a label.

The first entity identifier may include one sample identifier or a plurality of sample identifiers, when the first entity identifier is a sequence composed of a plurality of sample identifiers, the large language model may generate a prediction probability distribution corresponding to each sample identifier, each prediction probability distribution includes a prediction probability value of each candidate identifier, the entity identifier of the sample object may be determined based on the candidate identifier with the highest prediction probability value, and when the first entity identifier is a single sample identifier, the prediction probability distribution includes a prediction probability value of each candidate identifier.

In particular, the first entity identifies a sample entity for indicating the sample object, the sample identifier may be a word or an index of the word, and the sample identifier may also be in other forms, so long as it is ensured that the first entity identifies a sample entity capable of indicating the sample object, and embodiments of the present disclosure are not limited herein.

For example, when the sample identifier is a word, the candidate identifier is also a word, all of the words are recorded in the vocabulary, and the predictive probability distribution includes a predictive probability value for each word in the vocabulary, e.g., the first entity identification is [ _coarse ] [ olf ] [ G ], the first entity identification includes 3 sample identifiers, the first sample identifier is [ _coarse ], the second sample identifier is [ olf ], the sample identifier is [ _g ].

For another example, when the sample identifier is an index of a word segment, the candidate identifier is also an index of the word segment, all the word segments and the corresponding indexes may be recorded in the vocabulary, the prediction probability distribution includes a prediction probability value of the index of each word segment in the vocabulary, the prediction probability distribution includes 100 prediction probability values assuming that the vocabulary includes 100 word segments and the corresponding indexes, wherein the indexes are used to indicate positions of the corresponding word segments in the word embedding matrix, the word embedding matrix includes word embedding vectors of each word segment in the word embedding space in the vocabulary, assuming that the first entity identifier is [10,40,5], the first entity identifier includes 3 sample identifiers, respectively 10,40, and 5, the sample identifier 10 is used to indicate a 10 th position in the word embedding matrix, the sample identifier 40 is used to indicate a 40 th position in the word embedding matrix, and the sample identifier 5 is used to indicate a 5 th position in the word embedding matrix, at this time, the first entity identifier may exist in the form of an integer sequence, which may be further integer, and each entity in the compact knowledge base is realized.

Based on the method, a sample image is obtained, then a first mask corresponding to each visual object in the sample image is obtained through segmentation of the sample image, a second mask corresponding to each visual object in the sample image is obtained, further, a second mask and sample visual features corresponding to each first mask are obtained through feature extraction, further, the second mask and sample position features corresponding to each first mask are extracted, text prediction is carried out through a first large language model, a prediction probability distribution for determining entity identifiers can be generated, then the sample probability distribution is determined based on the first entity identifiers of sample entities, the sample probability distribution is used as tag data, model loss is determined according to the difference between the prediction probability distribution and the sample probability distribution, supervised learning is carried out on the first large language model based on the model loss, and understanding of global features and pixel level details of the target image can be improved by the first large language model through iterative training, so that the prediction accuracy of the target entity identifiers is improved.

In one possible implementation manner, the sample image, the second mask and the first entity identifier are all obtained from the dataset, before the sample image and the second mask corresponding to the sample object in the sample image are obtained, the identifier generation method further comprises the steps of obtaining a plurality of original images and query texts corresponding to the original images, respectively determining identification information corresponding to the original images according to the original images and the corresponding query texts, obtaining a plurality of candidate entities, respectively determining link entities corresponding to the original images in the candidate entities based on the identification information, respectively determining the mark mask of the original images based on the original images and the corresponding query texts, and storing the original images, the corresponding mark mask and the corresponding link entities in the dataset.

The original image is an image needing to be subjected to visual entity linking, the original image generally comprises a plurality of objects, but only the objects of interest in the original image need to be subjected to visual entity linking, the objects of interest are similar to the target objects, the objects of interest are equivalent to objects selected from the plurality of objects in the original image, query texts are used for prompting and identifying the objects of interest in the corresponding original image, the query texts can represent intentions for referring to the objects of interest, mark masks are used for indicating the corresponding objects of interest, and the mark masks can be used for specifying local areas where the objects of interest are located in the original image.

Specifically, each candidate entity may be a label of an entity in a knowledge base, where each entity is generally designed to represent a unique object and corresponds to a globally unique label, for example, a candidate entity may be e=q10000 XX, another candidate entity may be e=q20000 XX, e is an entity (entity), and Q10000XX and Q20000XX are both labels, so that a link entity may also be a label of an entity in the knowledge base.

The sample images are sampled from a plurality of original images, the second mask is an annotation mask corresponding to the sample images, and the sample entities are link entities linked with the sample images.

Based on the additional semantic context provided by the original image and the corresponding query text, accurate identification information can be determined, and further based on the identification information, accurate link entities can be determined, so that the original image, the corresponding labeling mask and the corresponding link entities can be used as a group of training samples, the original image, the corresponding labeling mask and the corresponding link entities are associated and stored into a data set, the original image and the corresponding link entities are linked, the link relationship between the attention object indicated by the labeling mask and the corresponding link entities is realized, and as the data set generally comprises a plurality of groups of training samples, the training samples can be sampled from the data set in the training process, then the original image in the sampled training samples is determined to be a sample image, the labeling mask in the sampled training samples is determined to be a second mask, and the link entities in the sampled training samples are determined to be a sample entity, and the correlation between the sampled sample image, the second mask and the sample entities is ensured, and the training process is effectively ensured.

In the training process, the training samples acquired from the data set can be used for pre-training the first large language model, so that the first large language model learns knowledge of the entity and the entity identifier, and the first large language model is helpful for generating an effective target entity identifier in the reasoning process. And then, fine tuning the first large language model by using the training sample of the downstream scene so as to improve the capability of the first large language model for fine-grained visual entity linking.

Specifically, the link entity may be determined in a variety of ways, and the first determination of the link entity is described in detail below.

In one possible implementation manner, the identification information is an identification coding result, and the identification information corresponding to each original Image is determined according to each original Image and the corresponding query text, specifically, each original Image and the corresponding query text are input into a multi-modal coding model to be coded, so as to obtain the identification coding result corresponding to each original Image, where the multi-modal coding model may be a contrast language Image Pre-training model (Contrastive Language-Image Pre-training model, CLIP), a path language Image model (Pathways Language and Image model, paLI), and the embodiment of the disclosure is not limited herein.

And determining the candidate coding result with highest similarity between the identification coding result corresponding to the original image and each candidate coding result as a target coding result, and determining the candidate entity corresponding to the target coding result as the link entity corresponding to the original image based on the similarity between the identification coding result corresponding to the original image and each candidate coding result, wherein the candidate coding result is obtained by coding the candidate name text of each candidate entity and the corresponding candidate image in the knowledge base.

It should be noted that, in addition to storing the labels of the candidate entities, the knowledge base stores related texts of the candidate entities and corresponding candidate images, where the candidate images are used for displaying the candidate entities, the related texts of the candidate entities include name texts of the candidate entities and description texts, the name texts are used for indicating the entities and have readability, and the description texts are used for describing the candidate entities.

The second way of determining the linking entity is described in detail below.

In another possible implementation manner, the identification information is identification name text, and the identification information corresponding to each original image is determined according to each original image and the corresponding query text, specifically, each original image and the corresponding query text are respectively input into a multi-modal coding model to be coded, so as to obtain identification coding results corresponding to each original image, and the identification coding results are decoded to obtain the identification name text corresponding to the original image.

Then, based on each piece of identification information, determining a link entity corresponding to each original image in a plurality of candidate entities, specifically, for any one original image, based on the identification name text corresponding to the original image, searching out a consistent target name text in the candidate name text of each candidate entity in a knowledge base, and determining the candidate entity corresponding to the target name text as the link entity corresponding to the original image, wherein based on the determination, the link entity corresponding to the original image can be accurately determined.

It should be noted that the link entity may also be determined by other manners, and embodiments of the present disclosure are not limited herein.

In one possible implementation manner, the labeling mask of each original image is determined based on each original image and the corresponding query text, specifically, each query text is input into a second large language model for text prediction to generate a summary text corresponding to each original image, object detection is performed based on each original image and the corresponding summary text to generate an original bounding box corresponding to each original image, wherein the original bounding box is used for indicating a corresponding object of interest, and each original image and the corresponding original bounding box are input into a first mask generation model for mask prediction to generate the labeling mask corresponding to each original image.

Wherein the second large language model belongs to a large language model, the second large language model is used for processing a task of extracting a finger representation expression of a text, the second large language model can generate a corresponding summarized text based on the query text, the summarized text can describe a position of an object of interest or a relation with other objects, for example, assuming that the query text is "what is a brown article placed on a chair", the corresponding summarized text can be "brown article on a chair".

Based on the method, the object detection is carried out based on the original image and the corresponding generalized text, an accurate original boundary frame can be generated, the original boundary frame can cover the area where the object of interest is located, then the original image and the corresponding original boundary frame are input into the first mask generation model for mask prediction, accurate position prompt is provided through the original boundary frame, the quality of the first mask generation model for generating the labeling mask can be improved, and therefore the labeling success rate of the labeling mask is effectively improved.

In particular, referring to fig. 5, fig. 5 is a schematic flow chart of an alternative method for updating a data set according to an embodiment of the present disclosure.

Firstly, determining identification information according to an original image and a corresponding query text, and determining a corresponding link entity based on the identification information;

And then, inputting the query text into a second large language model for text prediction to generate a generalized text, and inputting the original image and the corresponding generalized text into an object prediction model for object detection to generate an accurate original bounding box, wherein the object detection model can adopt a Grounding DINO model or other models, and the embodiment of the disclosure is not limited herein.

Then, the original image and the corresponding original bounding box are input into a first mask generating Model to perform mask prediction to generate a labeling mask, wherein the first mask generating Model can adopt models such as SEGMENT ANYTHING Model (SAM), FAST SEGMENT ANYTHING Model (FastSAM) and the like, and embodiments of the disclosure are not limited herein.

The original image, the corresponding annotation mask, and the corresponding link entity associations are then stored to a dataset.

In a possible implementation manner, referring to fig. 6, fig. 6 is an optional architecture diagram of removing original images provided by the embodiment of the present disclosure, wherein each original image and a corresponding original bounding box are respectively input into a first mask generation model to perform mask prediction, after labeling masks corresponding to each original image are generated, the identifier generation method further includes obtaining a reference name text corresponding to each link entity, where the reference name text is used to indicate a name of the reference entity, a level of the reference entity in a knowledge base is higher than a level of the link entity in the knowledge base, respectively inputting each original image and a corresponding reference name text into a second mask generation model to perform mask prediction, generating reference masks corresponding to each original image, determining a matching degree between each labeling mask and a corresponding reference mask, and obtaining a target matching degree corresponding to each labeling mask, and removing the original image corresponding to the target matching degree when the target matching degree is smaller than a preset matching degree threshold.

The knowledge base may be a knowledge graph, different entities in the knowledge base may have a hierarchical relationship, and assuming that the knowledge base includes an entity e=Q73XX named "mammal", the knowledge base further includes an entity e=Q3009XX named "cat", since "cat" belongs to "mammal", the knowledge base may be set to have a hierarchical relationship between the entity e=Q73XX and the entity e=Q3009XX, and the level of the entity e=Q73XX is higher than the entity e=Q3009XX, and therefore, assuming that the linked entity is the entity e=Q3009XX, since the level of the entity e=Q73XX is higher than the entity e=Q3009XX, the entity e=Q73XX may be determined as a reference entity, and the "mammal" may be determined as a reference name text, and the entity of the higher level may be determined as a reference entity.

The second mask generating Model may be SEGMENT EVERYTHING EVERYWHERE ALL AT on Model (SEEM) Model, or may be another Model, which is not limited in this disclosure.

It should be noted that the matching degree threshold may be obtained through prediction of the trained first regression model or determined through multiple experiments, and embodiments of the present disclosure are not limited herein.

Based on the method, an original image and a corresponding reference name text are input into a second mask generation model to conduct mask prediction, the semantic range of an object of interest is enlarged through the reference name text, the quality of a reference mask generated by the second mask generation model can be improved, then the matching degree between a labeling mask and the corresponding reference mask is determined, when the target matching degree is smaller than a preset matching degree threshold value, a low-quality labeling mask is possibly generated by the first mask generation model, the original image corresponding to the target matching degree is removed, the filtering is equivalent to the original image, a low-quality training sample can be prevented from being stored in a data set, so that the training quality of a first large language model is effectively improved, and the unsuitable labeling mask is filtered by taking the reference mask generated by the second mask generation model as a supplementary strategy so as to cope with the error propagation problem possibly occurring in the first mask generation model in the generation process.

Specifically, similar to the processing manner of the marking region, the matching degree between each marking mask and the corresponding reference mask can also be determined by calculating the intersection ratio or the center distance.

In one possible implementation manner, besides inputting the reference name text into the second mask generation model, the name text, the description text, the category text and other contents of the link entity can be input into the second mask generation model, so that the quality of the reference mask generated by the second mask generation model can be further improved.

In one possible implementation, each original image and corresponding query text may be partitioned into different original subsets for different task goals, data processing requirements, or application scenarios, assuming that the original subsets include entity subsets and query subsets, the task goal of the entity subsets is to enable the model to identify specific entities from the original image, the query text of the entity subsets is typically used to describe the type of entities that should be identified from the image, e.g., the original image shows a cat, the query text may be "identify animals in the graph", and the task goal of the query subsets is to enable the model to identify both the entities in the image and understand the context and intent of the query text, e.g., the original image shows a girl wearing hair, the query text may be "what is being worn on the hair of a child".

Specifically, the original subset can further include a human labeling subset, the task target of the human labeling subset can be the same as the task target of the entity subset or the query subset, the original image in the human labeling subset is manually screened, and the query text in the human labeling subset is manually set, so that the quality of the human labeling subset can be ensured.

In one possible implementation manner, after determining the matching degree between each labeling mask and the corresponding reference mask to obtain the target matching degree corresponding to each labeling mask, the identification generating method further comprises the step of not adjusting the labeling mask when the target matching degree is greater than or equal to a matching degree threshold and the original image and the corresponding query text come from the entity subset, or replacing each labeling mask with the corresponding reference mask when the target matching degree is greater than or equal to the matching degree threshold and the original image and the corresponding query text come from the query subset.

Based on the method, the accuracy of the labeling mask of each original image in the entity subset can be improved, and the accuracy of the labeling mask of each original image in the query subset can also be improved.

In one possible implementation manner, before the original images, the corresponding labeling masks and the corresponding link entities are stored in the data set in an associated mode, the identification generation method further comprises the steps of counting the number of connected areas in the labeling masks to obtain the number of areas corresponding to the labeling masks, and eliminating the original images corresponding to the number of areas when the number of the areas is larger than a preset number threshold.

The connected region is a region formed by pixel points which have the same pixel value and are adjacent to each other in the mark mask, the mark mask can be a binary image, the pixel value of the pixel point in the connected region is 1, the pixel value of the pixel point in the region outside the connected region is 0, and the connected region corresponds to the concerned region in the mark mask.

Typically, most or all of the area occupied by an object in an image is connected, and therefore, when the number of connected areas is excessive, the representative annotation mask may indicate a plurality of objects in the original image, i.e., the annotation mask cannot accurately indicate only the object of interest in the original image.

Based on the above, when the number of the regions is greater than the number threshold, the number of the connected regions can be considered to be too large, the labeling mask is defined as a low-quality mask, and the low-quality training samples can be prevented from being stored in the data set by eliminating the corresponding original images, so that the training quality of the first large language model is effectively improved.

It should be noted that the number threshold may be obtained through prediction of the second regression model after training or may be determined through multiple experiments, and the embodiments of the present disclosure are not limited herein.

In one possible implementation, the removing of the original image corresponds to filtering the training sample, and since the object of interest not belonging to the visual entity cannot be effectively indicated by the labeling mask, in addition to filtering the training sample based on the number of regions, the training sample may be filtered by a filter, for example, filtering the query text by a text filter, so that the original image of the object of interest not belonging to the visual entity can be removed, for example, if the object of interest of the original image is a meeting, the original image is removed.

Specifically, referring to fig. 7, fig. 7 is an optional distribution diagram of entity classes in multiple sample sets according to an embodiment of the present disclosure.

Wherein a sample set consisting of all training samples before filtering and without a labeling mask is defined as an initial sample set and a sample set consisting of all training samples after filtering is defined as an optimized sample set, it can be seen that in each entity class the number of samples of the optimized sample set is smaller than the number of samples of the initial sample set, in particular in places, buildings and sports classes, objects of interest in the original image usually do not belong to visual entities, filtering is performed with a filter such that the number of samples of the optimized sample set is much smaller than the number of samples of the initial sample set.

Referring to fig. 8, fig. 8 is an alternative pie chart of optimizing entity categories in a sample set provided by an embodiment of the present disclosure.

It can be seen that the ratio of other classes, animal classes and plant classes is relatively large in the optimized sample set, and that these classes of objects of interest typically belong to visual entities, which can be effectively indicated by the labeling mask, and therefore, by optimizing the training samples in the sample set, the first large language model can be effectively trained.

In addition, referring to fig. 9, fig. 9 is an alternative distribution diagram of area ratios of the marking mask provided in the embodiment of the present disclosure.

The area ratio of the labeling mask is specifically the ratio between the area of the labeling mask and the area of the corresponding original image, and it can be seen that the distribution curve of the area ratio is overall smooth, and when the area ratio exceeds 95%, the frequency slightly rises, because a dense object group exists in part of the original image, for example, the original image comprises clustered vegetation, and the clustered vegetation is indicated as a coherent object by the labeling mask in the labeling process.

An alternative data statistics case for the initial sample set and the optimized sample set is described in detail below.

For the initial sample set, the data statistics are shown in table 1 below:

TABLE 1

As can be seen from table 1, the initial sample set may include a training set, a validation set, a test set, and a set of manual annotations, where the initial sample set includes 5245421 annotations of 5214965 images and covers 20077 entities, and the set of manual annotations refers to a set of manually-annotated data, and "visible" refers to that the objects of interest in the images generally belong to visual entities, and "invisible" refers to that the objects of interest in the images generally do not belong to visual entities.

For the optimized sample set, the data statistics are shown in table 2 below:

TABLE 2

As can be seen from Table 2, the optimized sample set contains 1965145 training samples with labeling masks, which enables efficient training of the first large language model.

In one possible implementation manner, the first entity identifier of the sample entity linked with the sample image is obtained, specifically, a sample name text of the sample entity linked with the sample image is obtained, the sample name text is segmented to obtain a plurality of first segmentation words, the occurrence frequency of each first segmentation word in a knowledge base is determined, the first segmentation words are ordered based on the order of the occurrence frequency from small to large, the first segmentation words arranged in the front L bits are determined to be second segmentation words, and the first entity identifier of the sample entity is determined based on each second segmentation word.

The sample name text is used to indicate a sample entity and has readability, and the sample name text may be segmented in various manners, for example, the sample name text is input to a text segmenter for segmentation, or the sample name text is segmented according to a fixed length, which is not limited herein.

The knowledge base comprises a plurality of candidate entities and corresponding candidate name texts, the candidate name texts are text corpus in the knowledge base, when the occurrence frequency of a first word in the knowledge base is determined, word segmentation can be carried out on all text corpus in the knowledge base, then a word segmentation union is constructed through all word segmentation structures, and the occurrence frequency of the first word in the word segmentation union is used as the occurrence frequency of the first word in the knowledge base. Alternatively, the text corpus may include descriptive text of the candidate entity in addition to the candidate name text, or other text, and embodiments of the present disclosure are not limited herein.

Wherein L is a positive integer, the first words are ordered based on the order of the occurrence frequencies from small to large, the first words arranged in the front L bits are determined as second words, the first words are taken as the second words after the occurrence frequencies of the first words in the word combination are increased, generally, L is smaller than or equal to the number of the first words, when the number of L is larger than the number of the first words, all the ordered first words are determined as the second words, the value of L is generally smaller, for example, L can be taken as 4, and the decoding efficiency of the first large language model can be improved.

Based on the method, the second word is obtained by word segmentation of the sample name text, the sample name text of the long text can be decomposed into smaller units, the first large language model can learn and understand the structure and meaning of the language more effectively, the processing efficiency can be improved, the first word is extracted after the occurrence frequency of word segmentation and concentration is increased, the first word with the lowest occurrence frequency is selected, the first entity identification determined by all the second words is more unique, confusion can be reduced, the second word with the lower occurrence frequency is arranged in front, the first large language model can decode more distinguishing results first, and therefore the prediction accuracy of the entity identification is improved, and in addition, the first entity identification is determined by L second words at most, the length of the entity identification is limited, and the decoding efficiency of the first large language model can be improved.

In particular, the first entity identifier may include a sample identifier or a plurality of sample identifiers, the sample identifier may be a word or an index of the word, and the sample identifier may be in other forms, and embodiments of the present disclosure are not limited herein.

Taking the candidate identifier as a word segmentation example, when the first entity identifier is a sequence composed of a plurality of sample identifiers, the determination formula of the first entity identifier is as follows:

Wherein,For the identification of the first entity,Is a word segmentation device of a text,For the sample name text to be used,For a plurality of first word segments,In order to be a knowledge base of the knowledge,Is the first in the knowledge baseThe number of candidate entities is chosen to be,Refers to the pair ofCandidate name text of each candidate entity is segmented,Refers to the word segmentation union of the word segmentation results of the candidate name text of all candidate entities in the knowledge base,For taking the first L segmentation words according to word frequency ascending order,Meaning that each first word is according to the union of the word in the word segmentationThe first L word segments are taken after the occurrence frequency of the Chinese character is increased, namelyIncluding each second word, each second word being a sample identifier, and thus the first entity identifierConsisting of all sample identifiers.

Illustratively, assuming that the sample name text is Golf coarse, the sample name text is segmented to obtain three first segmented words, which are [ _g ], [ olf ] and [ _coarse ] respectively, and the first segmented words are ordered based on the order of the occurrence frequencies from small to large to obtain [ _coarse ] [ olf ] [ G ], and assuming that L is equal to 3, the first entity is identified as [ _coarse ] [ olf ] [ G ].

In one possible implementation manner, the second image feature, the sample visual feature and the sample position feature are spliced and then input into the first large language model to conduct text prediction, so as to generate a prediction probability distribution, specifically, a prompt text for prompting the first large language model to generate entity identification is constructed, the text feature of the prompt text is extracted, and the second image feature, the text feature, the sample visual feature and the sample position feature are spliced and then input into the first large language model to conduct text prediction, so as to generate the prediction probability distribution.

Specifically, the prediction probability distribution generated based on the first large language model can determine a prediction identifier corresponding to each sample identifier, all the prediction identifiers can form a prediction entity identifier, and a determination formula of the prediction identifiers is as follows:

Wherein,To predict the first in entity identificationA predictive identifier of the location of the individual,To predict the first in entity identificationA predictive identifier preceding the location of the one,As a feature of the text it is,As a feature of the second image it is,For the sample visual features and sample location features,For the embedding of the indicator words into the matrix,For indicating word-based embedding matrix pairsThe word embedding process is performed such that,To predict the first in entity identificationThe predictive identifier preceding the individual position corresponds to the word embedding vector in the word embedding matrix,For indicating a first large language model, the first large language model being capable of being input based、、AndGenerating the first of the predicted entity identifiersProbability distribution of each position, and predicting the first position in the predicted entity identifier based on the probability distributionAnd then, inputting word embedded vectors corresponding to the predicted identifiers into a first large language model to enable the first large language model to predict the predicted identifiers of the next positions until the predicted identifiers of the next positions meet the ending condition of ending the preset, wherein all the predicted identifiers can form predicted entity identifiers.

The complete process of the identity generation method, including the training phase and the reasoning phase, is described in detail below.

The training phase is described in detail below.

Referring to fig. 10, fig. 10 is an alternative architecture diagram of a training phase provided by an embodiment of the present disclosure.

Firstly, acquiring a plurality of original images and query texts corresponding to the original images, respectively determining identification information corresponding to the original images according to the original images and the corresponding query texts, wherein the query texts are used for prompting and identifying the concerned objects in the corresponding original images, acquiring a plurality of candidate entities, and respectively determining the link entities corresponding to the original images in the candidate entities based on the identification information.

The method comprises the steps of inputting query texts into a first language model to conduct text prediction, generating generalized texts corresponding to original images, conducting object detection based on the original images and the generalized texts to generate original bounding boxes corresponding to the original images, wherein the original bounding boxes are used for indicating corresponding attention objects, inputting the original images and the original bounding boxes to a first mask generation model to conduct mask prediction, and generating marking masks corresponding to the original images, wherein the marking masks are used for indicating the corresponding attention objects.

The method comprises the steps of obtaining a reference name text corresponding to each link entity, wherein the reference name text is used for indicating the name of the reference entity, the level of the reference entity in a knowledge base is higher than that of the link entity in the knowledge base, inputting each original image and the corresponding reference name text into a second mask generation model to conduct mask prediction to generate a reference mask corresponding to each original image, determining the matching degree between each labeling mask and the corresponding reference mask to obtain target matching degree corresponding to each labeling mask, and eliminating the original image corresponding to the target matching degree when the target matching degree is smaller than a preset matching degree threshold.

The method comprises the steps of obtaining a plurality of original images, obtaining a plurality of connected areas in each marking mask, counting the number of the connected areas in each marking mask to obtain the number of the areas corresponding to each marking mask, eliminating original images corresponding to the number of the areas when the number of the areas is larger than a preset number threshold, and storing each original image, the corresponding marking mask and the corresponding link entity in a data set in an associated mode, wherein a sample image is obtained by sampling a plurality of original images, the second mask is the marking mask corresponding to the sample image, and the sample entity is the link entity linked with the sample image.

Then, a sample image and a second mask corresponding to a sample object in the sample image are obtained, the sample image is segmented to obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one object in a plurality of visual objects, for example, the sample image can be input into a target mask generation model for panoramic segmentation, and a plurality of first masks are obtained;

Then, extracting second image features of the sample image, for example, inputting the sample image into a visual encoder for multi-level feature extraction to obtain multi-level coding features;

Then, feature extraction based on the second mask and each first mask is performed on the sample image to obtain a plurality of sample visual features, sample position features corresponding to the second mask and each first mask are extracted, for example, the multi-level coded features, the second mask and each first mask may be input to a mask perception visual extractor, feature extraction based on the mask perception visual extractor is performed on the multi-level coded features to obtain a plurality of sample visual features, and a plurality of sample position features corresponding to the second mask and each first mask are obtained through position coding.

The method comprises the steps of establishing a first large language model, establishing a prompt text for prompting the first large language model to generate entity identification, extracting text characteristics of the prompt text, splicing second image characteristics, the text characteristics of the prompt text, sample visual characteristics and sample position characteristics, inputting the spliced second image characteristics, the spliced text characteristics, the sample visual characteristics and the sample position characteristics into the first large language model for text prediction, and generating prediction probability distribution, wherein the prediction probability distribution is used for determining the entity identification of a sample object;

Then, acquiring a sample name text of a sample entity linked with a sample image, segmenting the sample name text to obtain a plurality of first segmented words, determining the occurrence frequency of each first segmented word in a knowledge base, sequencing each first segmented word based on the sequence of the occurrence frequency from small to large, determining the first segmented word arranged in the front L bits as a second segmented word, wherein L is a positive integer, determining a first entity identifier of the sample entity based on each second segmented word, determining model loss based on the prediction probability distribution and the first entity identifier, and training a first large language model based on the model loss, wherein the sample entity is used for indicating a sample object.

The reasoning phase is described in detail below.

Firstly, obtaining a target image and a prompt mark of the target image, and dividing the target image to obtain local masks corresponding to candidate objects. The target image comprises a plurality of candidate objects, and the prompt mark is used for indicating the selected target object in the plurality of candidate objects.

Then, when the hint marks are marker points, a query mask is determined in each local mask based on a positional relationship between the marker points and each local mask, or when the hint marks are marker frames, a query mask is determined in each local mask based on a degree of matching between the marker frames and mask boundaries of each local mask, or when the hint marks are marker areas, a query mask is determined in each local mask based on a degree of matching between the marker areas and each local mask. Wherein the query mask is used to indicate a selected target object of the plurality of candidate objects.

And performing mask pooling on the multi-level visual features based on the query mask and each local mask respectively to obtain a plurality of multi-level pooling features.

Then, mapping sub-features of each level in any multi-level pooling feature to obtain intermediate features with the same multiple dimensions, carrying out feature fusion on each intermediate feature to obtain fusion features, carrying out multi-level perception processing on each fusion feature to obtain multiple mask visual features, and extracting mask position features corresponding to query masks and each local mask.

And extracting first image features of the target image, respectively determining mask areas of the local masks, and splicing the region features corresponding to the local masks according to the size sequence of the mask areas to obtain first spliced features.

Then, constructing a prompt text for prompting the first large language model to generate an entity identifier, extracting text characteristics of the prompt text, splicing the first image characteristics, the text characteristics, the first splicing characteristics and the area characteristics corresponding to the query mask to obtain target splicing characteristics, and performing text prediction based on the target splicing characteristics to generate a target entity identifier of a target object.

Based on the method, local masks corresponding to all candidate objects in a target image are obtained, query masks for indicating the target objects are determined, further, the query masks and mask visual features corresponding to all the local masks are obtained through feature extraction, mask position features corresponding to the query masks and all the local masks are extracted, the mask visual features can capture pixel-level visual information of a local area where the corresponding object is located, the mask position features can capture pixel-level position information of the local area where the corresponding object is located, then each mask visual feature is spliced with the corresponding mask position features respectively to obtain area features of all the local areas, the area features are equivalent to combining pixel-level visual information and corresponding pixel-level position information of the local areas into pixel-level area information, then first image features of the target image are extracted, the first image features and a plurality of area features are spliced into target splicing features, then text prediction is carried out based on the target splicing features to generate a target entity identifier of the target object, in the text prediction process, and through interaction of all the features in the target splicing features, the first image features can capture global visual information, then the pixel-level visual information corresponding to the candidate objects can be focused on the target-level visual features, the pixel-level information corresponding to the target-level visual information of the target object can be effectively predicted, and the target-level visual information can be further improved, and the target-level accuracy of the target image can be further improved, and the target-level accuracy can be further predicted, and the target-level image accuracy can be further improved.

The training process of the model provided in the present disclosure is described in detail below.

The first large language model provided by the present disclosure may undergo two-stage training, in a first training stage, the first large language model is pre-trained by an initial sample set comprising 200 ten thousand samples, and in a second training stage, the pre-trained first large language model is fine-tuned by optimizing a subset of entities in the sample set and a subset of queries.

Because the optimized data set contains about 450 ten thousand training samples, in the fine tuning process, the quantity of training samples corresponding to each entity can be limited to 50 or less in consideration of the limitation of computing resources and the huge scale of the optimized data set, so that only about 7% of training samples are used in the entity subset and the query subset, the total quantity is about 30 ten thousand, the computing resources can be effectively saved, and the training efficiency is improved. In addition, the sizes of all input images are uniformly preprocessed to 512 multiplied by 512, so that the consistency of training can be ensured, the length of the entity identifier is limited to 4, and the overlong length of the entity identifier is avoided, thereby improving the training efficiency.

Then, the evaluation process of the model provided by the present disclosure and other models is described in detail below.

The data set used for the evaluation can be divided into a validation set and a test set, and the evaluation results are shown in table 3 below:

TABLE 3 Table 3

Wherein "none" indicates that no hint is used to reference visual information,Refers to a discrimination model based on the retrieval,It is meant that a model is generated and,The reference model is different from the model provided by the disclosure in that visual features and position features of a mask corresponding to an object of interest in an image are input to a first large language model, and visual features and position features of a mask corresponding to other objects in the image are not input to the first large language model.

Specifically, table 3 shows the accuracy results of visual Language (Vision-Language) models of different prompt types on the verification set and the test set, the accuracy is used as a key index for measuring the model performance on the verification set and the test set, the performance of the model on the entity subset and the query subset is evaluated for each data set, and the overall accuracy of all samples in each subset is calculated as a final evaluation basis.

In addition, considering that the zero sample inference model has challenges in generating entity identifications and processing entity name texts in specific fields, the generated results need to be processed by using BM25 retrieval, specifically, 600 ten thousand entity name texts in a knowledge base need to be searched, and then the closest search result is selected as the basis of calculating accuracy.

As can be seen from table 3, compared with the visual language model based on text prompt, the model provided by the present disclosure achieves a significant improvement on the entity subset, the performance difference is between-2.0% and 11.3%, and the evaluation result shows that the challenges caused by lack of text prior can be alleviated by fine visual feature modeling.

Then, compared withAndThe model provided by the present disclosure has a 22% to 42% gap in performance over the query subset, which is believed to be due to the fact that the query subset is primarily derived from Visual Questions (VQA), which typically involve not only references to visual information, but also additional query intent, e.g., "made by..the.," produced by, "how much water is needed," etc., which require text to express user intent and make further inferences beyond the scope of VEL. The model provided by the present disclosure is based on visual mask reference cues, so these problems are not well covered. Based on this, for data in the query subset, its reference expression needs to be extracted using the second large language model and replaced with a visual mask marker to preserve reference information beyond the query intent.

Then, an ablation experiment of the model provided in the present disclosure is described in detail below.

The results of the ablation experiments are shown in table 4 below:

TABLE 4 Table 4

The reference model is different from the model provided by the present disclosure in that visual features and position features of a mask corresponding to an object of interest in an image are input to the first large language model, while visual features and position features of a mask corresponding to other objects in the image are not input to the first large language model, so that the reference model can be regarded as a model in which respective components of the mask corresponding to other objects are removed from the model provided by the present disclosure.

Specifically, table 4 shows the ablation experimental results of the model provided by the present disclosure, which can evaluate the effectiveness of visual semantic labeling and training, and the experimental results show that introducing fine-grained local visual features can significantly improve the accuracy of the model, increasing the accuracy on the entity subset by 3.7% to 5.0%, and increasing the accuracy on the query subset by 3.5% to 5.5%. In addition, the fine tuning also significantly improves the overall accuracy of the model, while the pre-training effect is relatively small, and the improvement range of the pre-training is 0.1% to 1.6%.

Based on this, it can be determined that the success of the model provided by the present disclosure in the pre-training phase is mainly due to the construction of a larger-scale pre-training dataset, e.g., the pre-training dataset contains 5500 tens of thousands of samples, and the use of GENERATIVE IMAGE-to-text transform (GIT) as backbone (with 4-billion parameters) in combination with a randomly initialized text decoder. By combining limited model parameters with the original pre-training strategy, it is helpful to provide the pre-training effect of the model provided by the present disclosure.

Then, the generalization ability of the model provided by the present disclosure is described in detail below.

TABLE 5

Wherein,Refers to a discrimination model based on the retrieval,It is meant that a model is generated and,The reference model is different from the model provided by the disclosure in that visual features and position features of a mask corresponding to an object of interest in an image are input to a first large language model, visual features and position features of a mask corresponding to other objects in the image are not input to the first large language model, and reference model-fine tuning refers to a fine-tuned reference model.

In particular, table 5 shows the accuracy results of different models on both the visible data subset and the invisible data subset, which are compared to the text hint based model in determining the generic ability due to the lack of text priors of the models provided by the present disclosure.

Wherein,Is a VEL model based on text cues,Candidate entities are retrieved using a CLIP-based encoder and then final entities are generated by decoding based on MLLM's candidate prefix tree constraints, both stages being end-to-end optimized by multitasking targets. Due toCombines the search enhancement) Decoding generation) Therefore, it isThe accuracy is relatively high over both the subset of visible entities (about 30%) and the subset of invisible entities (about 10%).

It is to be noted that,Generated for decoding without search enhancement and constraintThe model is used for the production of the model,The performance of the model provided by the disclosure is very close to that of the model, the difference range of the accuracy is-0.2% to +0.5%, which indicates that the retrieval enhancement may be an effective way to improve the generalization capability of the model, so that in the training process of the first large language model, when the link entity corresponding to each original image needs to be determined from a plurality of candidate entities, the retrieval enhancement mode can be introduced, thereby improving the generalization capability of the first large language model.

It can be seen that the identifier generating method provided by the embodiment of the present disclosure may be applied to various fields.

In the intelligent traffic field, for example, a road image can be acquired through a vehicle sensor, the road image can be used as a target image, then a local mask corresponding to each candidate object in the target image is acquired, a query mask is determined in the local masks, wherein the candidate objects can be pedestrians, vehicles, traffic signs and the like, the query mask is used for indicating the selected target object in the candidate objects, for example, a driver or a passenger can interact with a vehicle-mounted system so as to enable the vehicle-mounted system to acquire a prompt mark of the target image, the prompt mark is used for indicating the selected target object in the candidate objects, the target image is respectively subjected to feature extraction based on the query mask and each local mask, a plurality of mask visual features are acquired, the mask visual features are respectively spliced with the corresponding mask position features, a plurality of region features are acquired, the first image features of the target image are extracted, the first image features and the region features are spliced, the target splicing features are acquired, the target entity identification of the target object is generated based on the target splicing features, the target image is effectively improved, the global image features and the target entity can be accurately linked with the target entity, and the target entity can be accurately linked in the vehicle-mounted system, and the knowledge base can be accurately linked, and the knowledge of the target entity can be accurately linked.

For example, in the medical field, a medical image can be acquired through a medical instrument, for example, the medical image comprises a magnetic resonance imaging image, a pathology image, a computed tomography image and the like, then a local mask corresponding to each candidate object in the target image is acquired, a query mask is determined in the local masks, wherein the candidate objects can be organs, tissues and the like, the query mask is used for indicating a selected target object in the candidate objects, for example, a doctor can interact with a medical system to enable the medical system to acquire a prompt mark of the target image, the prompt mark is used for indicating the selected target object in the candidate objects, the target image is respectively subjected to feature extraction based on the query mask and each local mask to obtain a plurality of mask visual features, the mask visual features and the mask position features corresponding to each local mask are respectively spliced with the corresponding mask position features to obtain a plurality of region features, a first image feature of the target image is extracted, the first image feature and the region features are spliced to obtain target splicing features, a text prediction is performed based on the target splicing features to generate a target entity identifier of the target object, the target entity identifier can be effectively improved, the target entity identifier and the global entity can be more accurately linked with the target entity identifier, and the knowledge information can be more accurately linked to the target entity identifier can be more accurately understood, and the information can be more accurately linked to the target entity identifier can be more accurately understood, and can be more accurately linked to the target object.

It will be appreciated that, although the steps in the flowcharts described above are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an alternative identifier generating device provided in an embodiment of the present disclosure, where the identifier generating device 1100 includes:

An obtaining module 1101, configured to obtain local masks corresponding to respective candidate objects in the target image, and determine a query mask from the plurality of local masks, where the query mask is used to indicate a selected target object from the plurality of candidate objects;

The feature extraction module 1102 is configured to perform feature extraction based on the query mask and each local mask on the target image, obtain a plurality of mask visual features, and extract mask position features corresponding to the query mask and each local mask;

A first stitching module 1103, configured to stitch each mask visual feature with a corresponding mask position feature to obtain a plurality of region features;

the second stitching module 1104 is configured to extract a first image feature of the target image, stitch the first image feature and the plurality of region features, and obtain a target stitched feature;

a generating module 1105, configured to perform text prediction based on the target stitching feature, and generate a target entity identifier of the target object.

Further, the second splicing module 1104 is specifically configured to:

and splicing the first image features, the first splicing features and the region features corresponding to the query mask to obtain target splicing features.

Further, the second splicing module 1104 is specifically configured to:

Further, the feature extraction module 1102 is specifically configured to:

mapping sub-features of each level in the multi-level pooling features to obtain intermediate features with the same dimensions, and carrying out feature fusion on each intermediate feature to obtain fusion features;

And respectively carrying out multi-layer sensing processing on each fusion feature to obtain a plurality of mask visual features.

Further, the obtaining module 1101 is specifically configured to:

when the prompt marks are marked points, determining inquiry masks in the local masks based on the position relation between the marked points and the local masks;

Further, the target entity identifier is generated by a first large language model, and the identifier generating device further includes a training module (not shown in the figure), where the training module is specifically configured to:

Extracting second image features of the sample image, respectively extracting features of the sample image based on the second mask and each first mask to obtain a plurality of sample visual features, and extracting sample position features corresponding to the second mask and each first mask;

the second image features, the sample visual features and the sample position features are spliced and then input into a first large language model for text prediction, and a prediction probability distribution is generated, wherein the prediction probability distribution is used for determining the entity identification of the sample object;

A first entity identification of a sample entity linked to the sample image is obtained, model loss is determined based on the predictive probability distribution and the first entity identification, and a first large language model is trained based on the model loss, wherein the sample entity is used for indicating a sample object.

Further, the sample image, the second mask, and the first entity identification are all obtained from a dataset, and the training module is further configured to:

acquiring a plurality of candidate entities, and respectively determining link entities corresponding to each original image in the plurality of candidate entities based on each piece of identification information;

Determining a labeling mask of each original image based on each original image and a corresponding query text, wherein the labeling mask is used for indicating a corresponding object of interest;

And storing each original image, the corresponding annotation mask and the corresponding link entity in a data set, wherein the sample image is sampled from a plurality of original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the link entity linked by the sample image.

Further, the training module is specifically configured to:

Object detection is carried out based on each original image and the corresponding generalized text respectively, and an original boundary box corresponding to each original image is generated, wherein the original boundary box is used for indicating a corresponding concerned object;

Further, the training module is further configured to:

Acquiring a reference name text corresponding to each link entity, wherein the reference name text is used for indicating the name of the reference entity, and the level of the reference entity in the knowledge base is higher than that of the link entity in the knowledge base;

respectively inputting each original image and corresponding reference name text into a second mask generation model to perform mask prediction, and generating a reference mask corresponding to each original image;

And when the target matching degree is smaller than a preset matching degree threshold value, eliminating the original image corresponding to the target matching degree.

Further, the training module is further configured to:

Further, the training module is specifically configured to:

acquiring a sample name text of a sample entity linked with a sample image, and segmenting the sample name text to obtain a plurality of first segmentation words;

determining the occurrence frequency of each first word in a knowledge base, sequencing each first word based on the sequence of the occurrence frequency from small to large, and determining the first word arranged in the front L bits as a second word, wherein L is a positive integer;

Based on each second word segment, a first entity identification of the sample entity is determined.

The above-mentioned label generating device 1100 and label generating method are based on the same invention conception, through obtaining the local mask corresponding to each candidate object in the target image, and determining the query mask for indicating the target object, and then through feature extraction, obtain the query mask and mask visual features corresponding to each local mask, and extract the mask position features corresponding to the query mask and each local mask, the mask visual features can capture the pixel level visual information of the local area where the corresponding object is located, the mask position features can capture the pixel level visual information of the local area where the corresponding object is located, then splice each mask visual feature with the corresponding mask position features respectively, obtain the area features of each local area, which is equivalent to combining the pixel level visual information of each local area with the corresponding pixel level visual information as pixel level area information, then extract the first image features of the target image, and splice the first image features and the multiple area features as target splice features, then conduct text prediction based on the target splice features, generate the target entity identification of the target object, in the text prediction process, through interaction of each feature in the target splice features, can capture the global visual features and global visual features, and can capture the pixel level visual information of each local area and the corresponding pixel level visual information of each local area as well as the target object, and can also be used as the target entity image, and can improve the accuracy of the target image and the target entity, and the target object, and the target image quality can be predicted by improving the accuracy and the target level visual information, thereby further improving the prediction accuracy of the target entity identification.

The electronic device for performing the above-mentioned identification generation method according to the embodiment of the present disclosure may be a terminal, and referring to fig. 12, fig. 12 is a partial block diagram of the terminal according to the embodiment of the present disclosure, where the terminal includes a camera assembly 1210, a first memory 1220, an input unit 1230, a display unit 1240, a sensor 1250, an audio circuit 1260, a wireless fidelity (WIRELESS FIDELITY, wiFi) module 1270, a first processor 1280, and a first power supply 1290. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 12 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The camera assembly 1210 may be used to capture images or video. Optionally, camera assembly 1210 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions.

The first memory 1220 may be used to store software programs and modules, and the first processor 1280 performs various functional applications and data processing of the terminal by executing the software programs and modules stored in the first memory 1220.

The input unit 1230 may be used to receive input numerical or character information and generate key signal inputs related to the setting and function control of the terminal. In particular, the input unit 1230 may include a touch panel 1231 and other input devices 1232.

The display unit 1240 may be used to display input information or provided information and various menus of the terminal. The display unit 1240 may include a display panel 1241.

Audio circuitry 1260, speaker 1261, microphone 1262 may provide an audio interface.

The first power source 1290 may be alternating current, direct current, a disposable battery, or a rechargeable battery.

The number of sensors 1250 may be one or more, and the one or more sensors 1250 include, but are not limited to, acceleration sensors, gyroscopic sensors, pressure sensors, optical sensors, and the like. Wherein:

The acceleration sensor may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal. For example, an acceleration sensor may be used to detect the components of gravitational acceleration in three coordinate axes. The first processor 1280 may control the display unit 1240 to display the user interface in a lateral view or a longitudinal view according to the gravitational acceleration signal acquired by the acceleration sensor. The acceleration sensor may also be used for the acquisition of motion data of a game or a user.

The gyroscope sensor can detect the body direction and the rotation angle of the terminal, and the gyroscope sensor can be cooperated with the acceleration sensor to collect the 3D action of the user on the terminal. The first processor 1280 may implement functions such as motion sensing (e.g., changing a UI according to a tilting operation of a user), image stabilization at photographing, game control, and inertial navigation according to data collected by a gyro sensor.

The pressure sensor may be disposed at a side frame of the terminal and/or a lower layer of the display unit 1240. When the pressure sensor is disposed at a side frame of the terminal, a grip signal of the terminal by a user may be detected, and left-right hand recognition or shortcut operation is performed by the first processor 1280 according to the grip signal collected by the pressure sensor. When the pressure sensor is disposed at the lower layer of the display unit 1240, the control of the operability control on the UI interface is achieved by the first processor 1280 according to the pressure operation of the user on the display unit 1240. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor is used to collect the ambient light intensity. In one embodiment, the first processor 1280 may control the display brightness of the display unit 1240 according to the ambient light intensity collected by the optical sensor. Specifically, the display luminance of the display unit 1240 is turned up when the ambient light intensity is high, and the display luminance of the display unit 1240 is turned down when the ambient light intensity is low. In another embodiment, the first processor 1280 may also dynamically adjust the shooting parameters of the camera assembly 1210 according to the ambient light intensity collected by the optical sensor.

In this embodiment, the first processor 1280 included in the terminal may perform the identification generation method of the previous embodiment.

The electronic device for performing the above-mentioned identifier generating method according to the embodiments of the present disclosure may also be a server, and referring to fig. 13, fig. 13 is a partial block diagram of a server according to the embodiments of the present disclosure, where the server may have a relatively large difference due to different configurations or performances, and may include one or more second processors 1310 and second memories 1330, one or more storage media 1340 (such as one or more mass storage devices) storing application programs 1343 or data 1342. Wherein the second memory 1330 and the storage medium 1340 may be transitory or persistent storage. The program stored on the storage medium 1340 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the second processor 1310 may be configured to communicate with a storage medium 1340 and execute a series of instruction operations in the storage medium 1340 on a server.

The server may also include one or more second power supplies 1320, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1360, and/or one or more operating systems 1341, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

A second processor 1310 in the server may be used to perform the identification generation method.

The embodiments of the present disclosure also provide a computer-readable storage medium storing a computer program for executing the identification generation method of the foregoing embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute the identification generation method described above.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" is used to describe an association relationship of an associated object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that only a exists, only B exists, and three cases of a and B exist simultaneously, where a and B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

While the preferred embodiments of the present disclosure have been described in detail, the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present disclosure, and these equivalent modifications and substitutions are intended to be included in the scope of the appended claims.

Claims

Translated fromChinese

1.一种标识生成方法，其特征在于，包括：1. A method for generating an identifier, comprising:

获取目标图像中各个候选对象对应的局部掩模，在多个所述局部掩模中确定查询掩模，其中，所述查询掩模用于指示多个所述候选对象中被选择的目标对象；Acquire a local mask corresponding to each candidate object in the target image, and determine a query mask from the multiple local masks, wherein the query mask is used to indicate a target object selected from the multiple candidate objects;

对所述目标图像分别进行基于所述查询掩模以及各个所述局部掩模的特征提取，得到多个掩模视觉特征，将所述查询掩模以及各个所述局部掩模展平后输入至位置编码器进行映射，得到所述查询掩模以及各个所述局部掩模对应的掩模位置特征；Performing feature extraction based on the query mask and each of the local masks on the target image to obtain a plurality of mask visual features, flattening the query mask and each of the local masks and inputting them into a position encoder for mapping to obtain mask position features corresponding to the query mask and each of the local masks;

将各个所述掩模视觉特征分别与对应的所述掩模位置特征进行拼接，得到多个区域特征；splicing each of the mask visual features with the corresponding mask position features to obtain a plurality of regional features;

提取所述目标图像的第一图像特征，将所述第一图像特征以及多个所述区域特征进行拼接，得到目标拼接特征；Extracting a first image feature of the target image, and splicing the first image feature and a plurality of the regional features to obtain a target splicing feature;

基于所述目标拼接特征进行文本预测，生成所述目标对象的目标实体标识。Text prediction is performed based on the target concatenation features to generate a target entity identifier of the target object.

2.根据权利要求1所述的标识生成方法，其特征在于，所述将所述第一图像特征以及多个所述区域特征进行拼接，得到目标拼接特征，包括：2. The identification generation method according to claim 1, characterized in that the step of splicing the first image feature and the plurality of regional features to obtain a target splicing feature comprises:

分别确定各个所述局部掩模的掩模面积，按照所述掩模面积的大小顺序，将各个所述局部掩模对应的所述区域特征进行拼接，得到第一拼接特征；Determine the mask area of each of the local masks respectively, and splice the regional features corresponding to each of the local masks in order of the size of the mask area to obtain a first splicing feature;

将所述第一图像特征、所述第一拼接特征以及所述查询掩模对应的所述区域特征进行拼接，得到目标拼接特征。The first image feature, the first stitching feature, and the region feature corresponding to the query mask are stitched together to obtain a target stitching feature.

3.根据权利要求2所述的标识生成方法，其特征在于，所述目标实体标识由第一大语言模型生成，所述将所述第一图像特征、所述第一拼接特征以及所述查询掩模对应的所述区域特征进行拼接，得到目标拼接特征，包括：3. The identification generation method according to claim 2, characterized in that the target entity identification is generated by a first large language model, and the first image feature, the first splicing feature and the region feature corresponding to the query mask are spliced to obtain the target splicing feature, comprising:

构建用于提示所述第一大语言模型生成实体标识的提示文本；Constructing a prompt text for prompting the first language model to generate an entity identifier;

提取所述提示文本的文本特征，将所述第一图像特征、所述文本特征、所述第一拼接特征以及所述查询掩模对应的所述区域特征进行拼接，得到目标拼接特征。The text feature of the prompt text is extracted, and the first image feature, the text feature, the first splicing feature and the region feature corresponding to the query mask are spliced to obtain a target splicing feature.

4.根据权利要求1所述的标识生成方法，其特征在于，所述对所述目标图像分别进行基于所述查询掩模以及各个所述局部掩模的特征提取，得到多个掩模视觉特征，包括：4. The identification generation method according to claim 1, characterized in that the feature extraction based on the query mask and each of the local masks is performed on the target image to obtain a plurality of mask visual features, comprising:

对所述目标图像进行多层级特征提取，得到所述目标图像的多层级视觉特征；Performing multi-level feature extraction on the target image to obtain multi-level visual features of the target image;

分别基于所述查询掩模以及各个所述局部掩模，对所述多层级视觉特征进行掩模池化，得到多个多层级池化特征；Based on the query mask and each of the local masks, respectively, mask pooling is performed on the multi-level visual features to obtain a plurality of multi-level pooled features;

分别对各个所述多层级池化特征进行特征融合，得到多个掩模视觉特征。Feature fusion is performed on each of the multi-level pooling features to obtain multiple mask visual features.

5.根据权利要求4所述的标识生成方法，其特征在于，所述分别对各个所述多层级池化特征进行特征融合，得到多个掩模视觉特征，包括：5. The identification generation method according to claim 4, characterized in that the step of fusing the multi-level pooling features to obtain a plurality of mask visual features comprises:

对于任意一个所述多层级池化特征，分别对所述多层级池化特征中各个层级的子特征进行映射，得到多个维度相同的中间特征，将各个所述中间特征进行特征融合，得到融合特征；For any of the multi-level pooling features, sub-features of each level in the multi-level pooling features are mapped respectively to obtain multiple intermediate features of the same dimension, and each of the intermediate features is subjected to feature fusion to obtain a fused feature;

分别对各个所述融合特征进行多层感知处理，得到多个掩模视觉特征。Multi-layer perception processing is performed on each of the fused features to obtain multiple mask visual features.

6.根据权利要求1所述的标识生成方法，其特征在于，所述获取目标图像中各个候选对象对应的局部掩模，在多个所述局部掩模中确定查询掩模，包括：6. The identification generation method according to claim 1, characterized in that the obtaining of local masks corresponding to each candidate object in the target image and determining a query mask from a plurality of the local masks comprises:

获取目标图像以及所述目标图像的提示标记，其中，所述目标图像包括多个候选对象，所述提示标记用于指示所述多个候选对象中被选择的目标对象；Acquire a target image and a prompt mark of the target image, wherein the target image includes a plurality of candidate objects, and the prompt mark is used to indicate a target object selected from the plurality of candidate objects;

对所述目标图像进行分割，得到各个所述候选对象对应的局部掩模，基于所述提示标记在各个所述局部掩模中确定查询掩模。The target image is segmented to obtain local masks corresponding to the candidate objects, and a query mask is determined in each of the local masks based on the prompt mark.

7.根据权利要求6所述的标识生成方法，其特征在于，所述基于所述提示标记在各个所述局部掩模中确定查询掩模，包括：7. The identification generation method according to claim 6, characterized in that the step of determining the query mask in each of the local masks based on the hint mark comprises:

当所述提示标记为标记点时，基于所述标记点与各个所述局部掩模之间的位置关系，在各个所述局部掩模中确定查询掩模；When the hint mark is a mark point, determining a query mask in each of the local masks based on a positional relationship between the mark point and each of the local masks;

或者，当所述提示标记为标记框时，基于所述标记框与各个所述局部掩模的掩模边界之间的匹配程度，在各个所述局部掩模中确定查询掩模；Alternatively, when the hint mark is a mark box, determining the query mask in each of the local masks based on a matching degree between the mark box and a mask boundary of each of the local masks;

或者，当所述提示标记为标记区域时，基于所述标记区域与各个所述局部掩模之间的匹配程度，在各个所述局部掩模中确定查询掩模。Alternatively, when the hint mark is a marked area, a query mask is determined in each of the local masks based on a matching degree between the marked area and each of the local masks.

8.根据权利要求1所述的标识生成方法，其特征在于，所述目标实体标识由第一大语言模型生成，所述第一大语言模型通过以下步骤训练得到：8. The identifier generation method according to claim 1, characterized in that the target entity identifier is generated by a first large language model, and the first large language model is trained by the following steps:

获取样本图像以及所述样本图像中样本对象对应的第二掩模，对所述样本图像进行分割，得到所述样本图像中各个视觉对象对应的第一掩模，其中，所述样本对象为多个所述视觉对象中的一个对象；Acquire a sample image and a second mask corresponding to a sample object in the sample image, segment the sample image, and obtain a first mask corresponding to each visual object in the sample image, wherein the sample object is one of the multiple visual objects;

提取所述样本图像的第二图像特征，对所述样本图像分别进行基于所述第二掩模以及各个所述第一掩模的特征提取，得到多个样本视觉特征，提取所述第二掩模以及各个所述第一掩模对应的样本位置特征；Extracting a second image feature of the sample image, performing feature extraction based on the second mask and each of the first masks on the sample image to obtain a plurality of sample visual features, and extracting sample position features corresponding to the second mask and each of the first masks;

将所述第二图像特征、所述样本视觉特征以及所述样本位置特征拼接后输入至所述第一大语言模型进行文本预测，生成预测概率分布，其中，所述预测概率分布用于确定所述样本对象的实体标识；splicing the second image feature, the sample visual feature, and the sample position feature and inputting them into the first language model for text prediction to generate a prediction probability distribution, wherein the prediction probability distribution is used to determine the entity identifier of the sample object;

获取所述样本图像所链接的样本实体的第一实体标识，基于所述预测概率分布与所述第一实体标识确定模型损失，基于所述模型损失训练所述第一大语言模型，其中，所述样本实体用于指示所述样本对象。Obtain a first entity identifier of a sample entity linked to the sample image, determine a model loss based on the predicted probability distribution and the first entity identifier, and train the first large language model based on the model loss, wherein the sample entity is used to indicate the sample object.

9.根据权利要求8所述的标识生成方法，其特征在于，所述样本图像、所述第二掩模以及所述第一实体标识均从数据集中获取，所述获取样本图像以及所述样本图像中样本对象对应的第二掩模之前，所述标识生成方法还包括：9. The identification generation method according to claim 8, characterized in that the sample image, the second mask and the first entity identification are all obtained from a data set, and before obtaining the sample image and the second mask corresponding to the sample object in the sample image, the identification generation method further comprises:

获取多个原始图像以及各个所述原始图像对应的查询文本，根据各个所述原始图像以及对应的所述查询文本，分别确定各个所述原始图像对应的识别信息，其中，所述查询文本用于提示识别出对应的所述原始图像中的关注对象；Acquire multiple original images and query texts corresponding to each of the original images, and determine identification information corresponding to each of the original images according to each of the original images and the corresponding query texts, wherein the query texts are used to prompt identification of the object of interest in the corresponding original image;

获取多个候选实体，基于各个所述识别信息，分别在多个所述候选实体中确定各个所述原始图像对应的链接实体；Acquire a plurality of candidate entities, and determine, based on the respective identification information, link entities corresponding to the respective original images from the plurality of candidate entities;

基于各个所述原始图像以及对应的所述查询文本，分别确定各个所述原始图像的标注掩模，其中，所述标注掩模用于指示对应的所述关注对象；Based on each of the original images and the corresponding query text, respectively determine a labeling mask for each of the original images, wherein the labeling mask is used to indicate the corresponding object of interest;

将各个所述原始图像、对应的所述标注掩模以及对应的所述链接实体关联存储至所述数据集，其中，所述样本图像从多个所述原始图像中采样得到，所述第二掩模为所述样本图像对应的所述标注掩模，所述样本实体为所述样本图像所链接的所述链接实体。Each of the original images, the corresponding annotation mask, and the corresponding linked entity are associated and stored in the data set, wherein the sample image is sampled from multiple original images, the second mask is the annotation mask corresponding to the sample image, and the sample entity is the linked entity linked to the sample image.

10.根据权利要求9所述的标识生成方法，其特征在于，所述基于各个所述原始图像以及对应的所述查询文本，分别确定各个所述原始图像的标注掩模，包括：10. The identification generation method according to claim 9, characterized in that the step of determining the annotation mask of each of the original images based on each of the original images and the corresponding query text comprises:

将各个所述查询文本分别输入至第二大语言模型进行文本预测，生成各个所述原始图像对应的概括文本；Inputting each of the query texts into the second largest language model for text prediction to generate summary texts corresponding to each of the original images;

分别基于各个所述原始图像和对应的所述概括文本进行对象检测，生成各个所述原始图像对应的原始边界框，其中，所述原始边界框用于指示对应的所述关注对象；Performing object detection based on each of the original images and the corresponding summarized text, respectively, to generate an original bounding box corresponding to each of the original images, wherein the original bounding box is used to indicate the corresponding object of interest;

分别将各个所述原始图像和对应的所述原始边界框输入至第一掩模生成模型进行掩模预测，生成各个所述原始图像对应的标注掩模。Each of the original images and the corresponding original bounding boxes are respectively input into a first mask generation model for mask prediction to generate a labeled mask corresponding to each of the original images.

11.根据权利要求10所述的标识生成方法，其特征在于，所述分别将各个所述原始图像和对应的所述原始边界框输入至第一掩模生成模型进行掩模预测，生成各个所述原始图像对应的标注掩模之后，所述标识生成方法还包括：11. The method for generating a marker according to claim 10, characterized in that, after inputting each of the original images and the corresponding original bounding boxes into a first mask generation model for mask prediction and generating a label mask corresponding to each of the original images, the method for generating a marker further comprises:

获取各个所述链接实体对应的参考名称文本，其中，所述参考名称文本用于指示参考实体的名称，所述参考实体在知识库中的层级高于所述链接实体在所述知识库中的层级；Acquire a reference name text corresponding to each of the link entities, wherein the reference name text is used to indicate the name of the reference entity, and the level of the reference entity in the knowledge base is higher than the level of the link entity in the knowledge base;

分别将各个所述原始图像和对应的所述参考名称文本输入至第二掩模生成模型进行掩模预测，生成各个所述原始图像对应的参考掩模；Inputting each of the original images and the corresponding reference name text into a second mask generation model for mask prediction to generate a reference mask corresponding to each of the original images;

确定各个所述标注掩模与对应的所述参考掩模之间的匹配程度，得到各个所述标注掩模对应的目标匹配度；Determining the matching degree between each of the annotated masks and the corresponding reference mask to obtain the target matching degree corresponding to each of the annotated masks;

当所述目标匹配度小于预设的匹配度阈值时，剔除所述目标匹配度对应的所述原始图像。When the target matching degree is less than a preset matching degree threshold, the original image corresponding to the target matching degree is discarded.

12.根据权利要求9所述的标识生成方法，其特征在于，所述将各个所述原始图像、对应的所述标注掩模以及对应的所述链接实体关联存储至所述数据集之前，所述标识生成方法还包括：12. The identification generation method according to claim 9, characterized in that before associating and storing each of the original images, the corresponding annotation masks, and the corresponding link entities in the data set, the identification generation method further comprises:

统计各个所述标注掩模中连通区域的数量，得到各个所述标注掩模对应的区域数量；Counting the number of connected regions in each of the annotation masks to obtain the number of regions corresponding to each of the annotation masks;

当所述区域数量大于预设的数量阈值时，剔除所述区域数量对应的所述原始图像。When the number of regions is greater than a preset number threshold, the original image corresponding to the number of regions is discarded.

13.根据权利要求8所述的标识生成方法，其特征在于，所述获取所述样本图像所链接的样本实体的第一实体标识，包括：13. The identification generation method according to claim 8, wherein obtaining the first entity identification of the sample entity linked to the sample image comprises:

获取所述样本图像所链接的样本实体的样本名称文本，对所述样本名称文本进行分词得到多个第一分词；Acquire a sample name text of a sample entity linked to the sample image, and segment the sample name text to obtain a plurality of first segmented words;

确定各个所述第一分词在知识库中的出现频率，基于各个所述出现频率由小至大的顺序，对各个所述第一分词进行排序，将排列在前L位的所述第一分词确定为第二分词，其中，所述L为正整数；Determine the occurrence frequency of each of the first participles in the knowledge base, sort each of the first participles based on the order of the occurrence frequency from small to large, and determine the first participles arranged in the first L positions as the second participles, where L is a positive integer;

基于各个所述第二分词，确定所述样本实体的第一实体标识。Based on each of the second participles, a first entity identifier of the sample entity is determined.

14.一种标识生成装置，其特征在于，包括：14. A device for generating an identifier, comprising:

获取模块，用于获取目标图像中各个候选对象对应的局部掩模，在多个所述局部掩模中确定查询掩模，其中，所述查询掩模用于指示多个所述候选对象中被选择的目标对象；An acquisition module, configured to acquire a local mask corresponding to each candidate object in a target image, and determine a query mask from a plurality of the local masks, wherein the query mask is used to indicate a target object selected from the plurality of candidate objects;

特征提取模块，用于对所述目标图像分别进行基于所述查询掩模以及各个所述局部掩模的特征提取，得到多个掩模视觉特征，将所述查询掩模以及各个所述局部掩模展平后输入至位置编码器进行映射，得到所述查询掩模以及各个所述局部掩模对应的掩模位置特征；A feature extraction module, configured to extract features of the target image based on the query mask and each of the local masks, obtain a plurality of mask visual features, flatten the query mask and each of the local masks, and input the flattened features into a position encoder for mapping, to obtain mask position features corresponding to the query mask and each of the local masks;

第一拼接模块，用于将各个所述掩模视觉特征分别与对应的所述掩模位置特征进行拼接，得到多个区域特征；A first splicing module, used for splicing each of the mask visual features with the corresponding mask position features to obtain a plurality of regional features;

第二拼接模块，用于提取所述目标图像的第一图像特征，将所述第一图像特征以及多个所述区域特征进行拼接，得到目标拼接特征；A second splicing module is used to extract a first image feature of the target image, and splice the first image feature and a plurality of the regional features to obtain a target splicing feature;

生成模块，用于基于所述目标拼接特征进行文本预测，生成所述目标对象的目标实体标识。A generation module is used to perform text prediction based on the target splicing features and generate a target entity identifier of the target object.

15.一种电子设备，包括存储器和处理器，所述存储器存储有计算机程序，其特征在于，所述处理器执行所述计算机程序时实现权利要求1至13任意一项所述的标识生成方法。15. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the identification generation method according to any one of claims 1 to 13 when executing the computer program.

16.一种计算机可读存储介质，所述存储介质存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现权利要求1至13任意一项所述的标识生成方法。16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the identification generation method according to any one of claims 1 to 13.

17.一种计算机程序产品，包括计算机程序，其特征在于，所述计算机程序被处理器执行时实现权利要求1至13任意一项所述的标识生成方法。17. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the identification generation method according to any one of claims 1 to 13 is implemented.