the triplet loss function is associated with each convolution layer, N — 3 represents the number of convolution layers involved in the triplet loss function, and α may be 0.2.

In a possible implementation manner, the cross entropy loss function can also be edge-defined BCE of boundary information to promote learning of the prediction result on the object boundary.

Please refer to fig. 8, which illustrates a model network framework diagram according to an embodiment of the present application. As shown in fig. 8, it shows a solution framework of the embodiment of the present application:

for the inputcolor image data 801 of the first image, processing is carried out through a colorimage processing branch 802 in the target area determination model to obtain a color feature map corresponding to the first image; processing the inputdepth image data 803 of the first image through a depthimage processing branch 804 in the target area determination model to obtain a depth feature map corresponding to the first image; inputting the features extracted from part of the convolutional layer in the colorimage processing branch 802 and the features extracted from part of the convolutional layer in the depthimage processing branch 804 into a feature fusion branch formed by a cross reference module CRM to obtain a fusion feature map; and respectively decoding the color feature map, the depth feature map and the fusion feature map through three decoders, and summing corresponding outputs into a final significance map.

That is, the above-mentioned fig. 8 is based on the dual-flow feature extraction network, and is composed of two core parts, namely, a deep calibration strategy and a fusion strategy. Firstly, a depth calibration strategy is provided to correct potential noise generated by an unreliable original depth map, and the corrected depth can reflect scene layout and identify a foreground region better than the original depth. Given the corrected RGB-D data, the RGB image and the corrected depth are simultaneously input into a dual-stream feature extraction network to generate multi-level features. A fusion strategy cross-referencing module is then designed to integrate efficient cues from RGB features and depth features to cross-modality fusion features, which allows the three decoding branches to process RGB, depth, and fusion features separately. All features are processed separately and the corresponding outputs are summed into a final saliency map.

Salient Object Detection (SOD) is an important computer vision problem that aims to identify and segment the most salient objects in a scene. It has been successfully applied to various tasks such as object recognition, image retrieval, SLAM (simultaneous localization and mapping) and video analysis. To address the challenges inherent in dealing with difficult scenes with low texture contrast or cluttered backgrounds, depth information is included as a supplemental input source, adding depth information as an additional input to the RGB image, and the localization of salient objects can be achieved in challenging scenes.

In the embodiment of the application, the double-flow feature extraction network based on fig. 8 includes two core parts, namely a deep calibration part and a fusion strategy. Based on the dual-stream feature extraction network shown in fig. 8, the embodiment of the present application further proposes a Depth Calibration (DC) strategy to correct potential noise caused by an unreliable original depth map and obtain a calibrated depth. The corrected depth can reflect the scene layout more than the original depth, and the foreground area is identified. Now, given the corrected RGB-D paired data, an RGB image

And corrected depth image F_i^DepthIs input into a dual-flow feature extraction network to generate hierarchical features. For each stream, a codec net is used as a backbone. This is the subsequent fusion strategy: design of cross-reference modules (CRMs) to integrate features, from RGB features and depth features to cross-modality fusion features; this results in three decoding branches processing RGB, depth features and fused hierarchy features. These features are processed separately and the corresponding outputs are summed to a final saliency map S_map。

Effective spatial information from depth maps plays a crucial role in helping to locate salient areas of challenging scenes, such as cluttered backgrounds and low contrast situations. However, unreliable raw depths and potential depth acquisition errors prevent the model from extracting accurate information from the depth map due to observation distance, occlusion or reflection. In order to solve the performance bottleneck caused by the noise of the depth map, the original depth is calibrated so as to better express the scene layout. Two key problems that this application solves are: 1. how the model learns to distinguish a depth map with poor quality (negative case) from a depth map with good quality (positive case); 2. how to make a corrected/corrected depth map can not only retain the useful clues of a high-quality depth map, but also correct unreliable information in a low-quality depth map. Thus, the present application proposes a Deep Calibration (DC) strategy, which is a core component of DCF. Two successive steps are required to select a representative sample and generate a corrected depth map.

Aiming at the first key problem, a difficulty awareness selection strategy is provided, and the purpose is to select the most typical positive and negative samples in the training database. These samples are then used to train a discriminator/classifier to predict the quality of the depth map, reflecting the reliability of the depth map. Firstly, the same architecture can be used for pre-training two model branches, RGB data and depth data are respectively used as input under the supervision of significance mapping and are respectively recorded as input under the supervision of significance mapping

Then, based on the significance of the two baseline model predictions, an option was devised to gauge whether the depth map could provide reliable information. Specifically, from the saliency results produced by the RGB stream and the depth stream, the intersection (IoU) measure between the predicted saliency and the ground-truth saliency of the two streams, denoted iou (depth) and iou (RGB), respectively, is first computed, for each training sample. Then, the IOU (depth) scores of all training samples are sequentially ordered from large to small. Based on the score ranking, the top 20% of the training samples will be considered as a typical positive sample set p_set(that is, the quality of the depth map is acceptable) and the bottom 20% will be considered as a typical negative sample set N_set(that is, the quality of the depth map is bad and unacceptable). In addition, when IOU (depth)>Iou (RGB), these samples will also be considered as positive samples, which means that the raw depth data provides a richer global cue in identifying foreground regions than the RGB input.

Based on the selected representative positive and negative examples, a binary discriminator/classifier based on the ResNet-18 model structure is trained to evaluate the reliability of the depth map. Thus, the trained discriminators are able to predict a reliability score p_posIndicating the probability that the depth map is positive or negative, respectively. p is a radical of_posThe higher the quality of the original depth map.

Furthermore, a depth estimator is established, which comprises a plurality of convolution blocks. The depth estimator is trained using RGB images and positive focus better quality depth data to mitigate the inherent noise caused by inaccuracies in the original depth data. At the depth calibration module, it may not be reliable to directly use the original depth map, so the original depth map may be replaced by a weighted sum between the original depth map and an estimated depth, which is obtained based on the depth estimator. Thus, obtaining Depth_calThe calibrated depth map is shown in the following formula:

Depth_cal＝Depth_raw*p_pos+Depth_est*(1-p_pos)

wherein, the term "Depth_est"and" Depth_raw"denotes the depth estimated by the depth estimator and the original depth map, respectively.

After the Depth correction process is finished, the calibrated Depth map Depth is used_calAnd the RGB image is sent into a double-current feature extraction network to generate hierarchical features which are respectively

And

note that in the embodiment of the present application, the last three volume blocks with rich semantic features are retained, and the first two volume blocks with high resolution are removed to balance the computation cost. Generally, features extracted from RGB channels contain rich semantic information and texture information; meanwhile, the features from the depth channel contain more discriminative scene layout clues, complementary to the RGB features. To integrate the Cross-modal information, a fusion strategy named Cross Reference Module (CRM) was designed, as shown in fig. 7 of the present embodiment.

The proposed CRM aims to mine and combine the most distinctive channels of depth and RGB features (i.e. feature detectors) and generate more informative features. Specifically, two input features are generated for the ith convolution block of RGB steam and depth steam

And F_i^DepthFirst, global statistics of RGB view and depth view are obtained using global average pooling. Then, the two feature vectors are respectively fed into a full connected layer (FC) and a softmax activation function to obtain a channel attention vector

And

reflecting the importance of the RGB features and the depth features, respectively. Then, attention vector and channelThe manner of multiplication is applied to the input features. Thus, CRM can focus on important features explicitly and suppress unnecessary scene understanding features, and the whole process can be defined as:

Att_i＝δ(w_i*AvgPooling(F_i)+b_i)

wherein w_iAnd b_iIs a parameter in the fully connected layer corresponding to the characteristics of the ith layer and AvgPooling indicates that an average pooling operation is performed. Then according to the channel attention characteristics

The channel attention weighting operation of, wherein

Representing a channel-based multiplication operation.

Will additionally follow the attention vector

And

aggregating through a maximum function to obtain more prominent characteristic channels from RGB (red, green and blue) streams and depth streams, then sending the more prominent characteristic channels into normalization operation, and normalizing the output to be within a range of 0-1, thereby obtaining a channel attention vector of mutual reference

This step may be defined as:

channel attention vector based on fusion

To be inputted

And

weighting to obtain enhanced features

And

the enhanced features of the RGB branch and the Depth branch are further connected and fed as a 1 x 1 convolution layer to generate a cross-modal fusion feature F_iThe process may be defined as:

then, the characteristics F are fused across the modes through the triple loss function_iAnd processing to make the fusion feature closer to the foreground and enlarge the distance between the foreground feature and the background feature. By mixing F_iThe feature corresponding to the salient region is set as a positive feature, and the feature corresponding to the background region is set as a negative feature, as follows:

wherein S represents the annotated saliency image region.

The triplet loss function can be calculated by the following formula:

where d represents the euclidean distance and m represents the margin parameter, set to 1.0.

On the basis of CRM, cross-modal characteristics can be acquired

And simultaneously characterizing RGB streams

And depth flow characteristics

Decoding by three decoders respectively, finally adding the outputs of the three decoders to obtain the final significance region S_map。

The optimization objective of the scheme as a whole can be described as

Wherein L is_RGB、L_DepthAnd L_fuseEach loss function corresponds to the output of the three decoders, where N-3 represents the number of convolution layers involved in the triple loss function, and α may be 0.2 in this application.

In salient object detection, the similarity of complex backgrounds, objects and surroundings is widely considered to be a challenging scenario. This naturally leads to a natural introduction of additional depth information, so-called depth-induced (RGB-D) saliency target detection, in addition to the traditional RGB image as input. At the same time, this emerging direction of research is largely hampered by the noise and blur that is prevalent in the original depth image. In order to solve the above problem, the embodiment of the present application proposes a depth calibration fusion framework, which includes two components: a novel learning strategy to calibrate potential deviations in the original depth image to improve detection performance; an efficient cross-referencing module fuses cross-complementary features from both RGB and depth map modalities. A large number of experiments have shown that the method has better performance than the most advanced methods.

The significance target detection has important value in real life. Salient Object Detection (SOD) is to identify the most interesting target regions in a scene. The salient object detection is different from fixation point prediction originating from the cognitive and psychological research fields, and is widely applied to different fields. In computer vision, applications of salient object detection include image understanding, image description generation, object detection, unsupervised video object segmentation, semantic segmentation, pedestrian re-recognition, and the like. In computer graphics, a saliency object detection task is widely applied to tasks such as VR (Virtual Reality) rendering, automatic image clipping, image redirection, video summarization, and the like. Example applications in the field of robotics, such as human-computer interaction and target discovery, as well as scene understanding of obstacle avoidance robots, also benefit from salient target detection. However, the mainstream saliency target detection methods are typically based on single input RGB images, which makes the representation generic in some complex scenes. Therefore, the introduction of the depth image greatly improves the positioning capability of the salient object detection field in the challenging scene. But also due to the depth acquisition equipment and the effects of natural environmental conditions, portions of the depth map may be significantly noisy. Therefore, it is necessary to introduce a depth calibration strategy in the current RGB-D saliency target detection field, to improve the utilization efficiency of depth information, and further to improve the detection accuracy.

The embodiment of the application provides a solution for detecting a salient target based on depth map quality calibration. First, two salient discriminating networks are pre-trained based on RGB and depth maps, respectively, as inputs. And then, a deep calibration learning strategy is designed by comparing the performances of the two pre-training networks, so that the quality of the depth map is improved. And a cross reference module is introduced to effectively fuse information fusion of two complementary features of depth and RGB, so that the utilization of depth information on salient target detection is greatly enhanced.

Meanwhile, the embodiment of the application provides a universal depth map calibration framework. Can be utilized in other advanced RGB-D saliency target detection methods and all bring a huge gain in performance.

Table 1 shows data representation of other schemes and the scheme shown in the examples of the present application on SIP data sets, which verifies the excellent performance of the method. In addition, the method provided by the embodiment of the application achieves excellent performance on a plurality of large-scale public significance target detection data sets.

TABLE 1

Table 2 verifies the performance gains from each component in the method proposed by this patent. Table 2 takes RGB data and the original depth map as input, respectively. It can be seen that the performance of the RGB branch is superior to the depth branch using the original depth map, indicating that the RGB input contains more semantic and texture information than the depth input. To evaluate the effectiveness of the depth calibration strategy, the original depth is compared to the reference network using the calibrated depth. As shown in table 2, the calibration depth reduced the MAE error index by 14.51% on average over the four data sets. Furthermore, to verify the generalization capability of the proposed depth calibration module, the generated calibration depth was also applied to two most advanced models, including D3Net and DMRA. As shown in Table 3, training D3Net and DMRA with the corrected depth instead of the original depth map resulted in significant performance improvements for both the DUT-D dataset and the NJU2K dataset. The MAE index of D3Net and DMRA decreased by 12.5% and 9.1%, respectively. Thus, a large number of experiments demonstrate the advantages of the proposed depth calibration strategy.

TABLE 2

Furthermore, for integrating RGB and depth features across modal fusion modules, a simple solution is to use concatenation followed by a convolution operation to fuse the complementary features. In table 2, it can be seen by comparing (d) and (f) that the cross-referencing module proposed herein is able to better fuse complementary information of RGB features and depth features than direct feature fusion. Meanwhile, comparing (f), after the triple loss function is removed, the performance of all experiments is reduced, and the effectiveness of the three groups of losses in the aspect of enhancing feature representation is shown.

Table 3 shows the effect of the calibration depth scheme according to the embodiment of the present application on the determination of the target area.

TABLE 3

To sum up, in the solution shown in the embodiment of the present application, estimated depth data corresponding to a first image is obtained through first color image data in the first image, the first depth image data corresponding to the first image is corrected according to the estimated depth data to obtain calibrated depth data, the calibrated depth data and the color image data are fused, and a target area is determined according to a fused feature map. According to the scheme, the depth information corresponding to the first image is estimated through the color image, the depth image corresponding to the first image is corrected, the target area corresponding to the first image is obtained according to the corrected depth image data and the color image data, and the accuracy of determining the target area is improved.

Fig. 9 is a block flow diagram illustrating a target area determination method according to an example embodiment. As shown in fig. 9, a flow chart of the target area determining method in the embodiment of the present application is formed by the

parts

900, 910, and 920 shown in fig. 9, where the

parts

900, 910, and 920 shown in fig. 9 may be implemented in different devices respectively, or may be implemented in the same device, and as shown in fig. 9, the target area determining method includes the following steps.

As shown inpart 900 of fig. 9, a training sample set may include a color image set 901 and a depth image set 903, where the color image set 901 includes at least two sample color images; the depth image set 903 includes at least two sample depth images; the color image set 901 corresponds to the images in the depth image set 903 in a one-to-one manner. Processing each sample color image in the color image set through a colorimage processing branch 902 in the target area determination model, so as to obtain a prediction area corresponding to each sample color image; processing each sample depth image in the depth image set through a depthimage processing branch 904 in the target region determination model, so as to obtain a prediction region 905 corresponding to each sample depth image; through the prediction region 905 corresponding to each sample depth image and the target region corresponding to each sample depth image, the confidence score corresponding to each sample depth image can be obtained, the sample depth images are ranked from large to small according to the confidence score corresponding to each sample depth image, a ranked sample depth image set 906 is obtained, a% before the confidence score is determined as a positive sample, and b% after the confidence score is determined as a negative sample.

As shown in theportion 910 of fig. 9, a confidencecoefficient discrimination model 912 exists in theportion 910 of fig. 9, the confidencecoefficient discrimination model 902 is obtained by training the positive samples and the negative samples in the sorted sample depth image set 906 in theportion 900 of fig. 9, and the confidence coefficient discrimination model may determine the confidence coefficient corresponding to the sample depth image according to the input sample depth image. Therefore, for eachsample depth image 911 in the training sample set, the sample depth images are respectively input into the confidencedegree discrimination model 912 to obtain the confidence degree corresponding to each sample depth image in the training sample set and are ranked, so as to obtain the rankedsample depth image 913, then the sample depth image with the middle confidence degree ranked c% of the rankedsample depth image 913 and the corresponding sample color image are trained on thedepth estimation model 915, so that the traineddepth estimation model 915 processes the input color image 914, so as to obtain the estimateddepth data 916 corresponding to the color image 914, and then the estimateddepth data 916 and the depth image corresponding to the color image 914 are weighted based on the confidence degree of the depth image corresponding to the color image, so as to obtain the correcteddepth image 917, wherein, the confidence degree of the depth image corresponding to the color image, may be obtained from the confidence measurediscriminative model 912.

In fig. 9, a color image is included in the color image set 921, and a corrected depth image corresponding to each color image in the color image set 921 is included in the corrected depth image set 922; inputting the color image corresponding to the first image in the color image set 921 into a color image processing branch in the target area determination model, so as to obtain a color feature map corresponding to the color image; inputting the depth image corresponding to the first image in the depth image set 922 into a depth image processing branch in the target area determination model, so as to obtain a depth feature map corresponding to the depth image; and respectively inputting the data extracted by the N convolutional layers in the color image processing branch and the data extracted by the N convolutional layers in the depth image processing branch into N cross reference modules CRM shown in figure 7 to realize the feature fusion between the depth image and the color image to obtain a fusion feature map, and acquiring the target area of the first image according to the fusion feature map, the depth feature map and the color feature map.

As shown in fig. 9, the goal is to select the most typical difficult and easy sample in the training database. These samples are then used to train a discriminator/classifier to predict the quality of the depth map, thereby reflecting the reliability of the depth map. And training a basic binary classifier based on the screened representative positive and negative samples to evaluate the reliability of the depth map. Thus, the trained arbiter can predict a reliability score for each datum, representing the probability of the depth map being positive or negative. In addition, the embodiment of the application also establishes a depth estimator, and the depth estimator comprises a plurality of convolution operations. The depth estimator is trained by using RGB images and depth data with better quality to reduce inherent noise caused by inaccurate original depth data. In the depth calibration module, the original depth map which may not be reliable is not directly used, but the result of weighted summation of the original depth map and the estimated depth map is used as input, thereby improving the utilization of the depth information.

Fig. 10 is a block diagram illustrating a structure of a target area determining apparatus according to an exemplary embodiment. The target area determination apparatus may implement all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 4, and the target area determination apparatus includes:

a firstimage acquisition module 1001 configured to acquire a first image; the first image comprises first color image data and first depth image data;

an estimateddepth obtaining module 1002, configured to obtain first estimated depth data based on the first color image data; the first estimated depth data is used to indicate depth information to which the first color image data corresponds;

a calibrationdepth obtaining module 1003, configured to obtain first calibration depth data based on the first estimated depth data and the first depth image data;

a fusionfeature obtaining module 1004, configured to perform weighting processing based on the first color image data and the first calibration depth data to obtain a first fusion feature map;

a targetregion determining module 1005, configured to determine a target region corresponding to the first image based on the first fusion feature map.

In one possible implementation, the apparatus further includes:

the calibrationdepth obtaining module 1003 is further configured to,

the estimateddepth obtaining module 1002 is further configured to,

In one possible implementation, the apparatus further includes:

the confidence score obtaining sub-module comprises:

In one possible implementation manner, the fusedfeature obtaining module 1004 includes:

the device further comprises:

the targetarea determination module 1005, further configured to,

the fusedfeature obtaining module 1004 includes:

In one possible implementation, the apparatus further includes:

the region determination model training module is further configured to,

Fig. 11 is a block diagram illustrating a structure of a target area determining apparatus according to an exemplary embodiment. The target area determination apparatus may implement all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 4, and the target area determination apparatus includes:

a third sampleimage obtaining module 1101, configured to obtain a third sample image; the third sample image includes third sample color image data and third sample depth image data;

a third sampleestimation obtaining module 1102, configured to obtain third sample estimation depth data based on the third sample color image data; the third sample estimated depth data is used to indicate depth information to which the third sample color image data corresponds;

a third samplecalibration obtaining module 1103, configured to obtain third sample calibration depth data based on the third sample estimated depth data and the third sample depth image data;

a third sample fusionfeature obtaining module 1104, configured to perform weighting processing on a feature fusion branch in a target area determination model based on the third sample color image data and the third sample calibration depth data, to obtain a third sample fusion feature map;

a region determinationmodel training module 1105, configured to train the target region determination model based on the third sample fusion feature map and a target region corresponding to the third sample image;

FIG. 12 is a block diagram illustrating a computer device according to an example embodiment. The computer device may be implemented as the model processing device and/or the text image matching device in the above-described respective method embodiments. Thecomputer apparatus 1200 includes a Central Processing Unit (CPU) 1201, asystem Memory 1204 including a Random Access Memory (RAM) 1202 and a Read-Only Memory (ROM) 1203, and asystem bus 1205 connecting thesystem Memory 1204 and theCentral Processing Unit 1201. Thecomputer device 1200 also includes a basic input/output system 1206, which facilitates transfer of information between various components within the computer, and amass storage device 1207, which stores anoperating system 1213,application programs 1214, andother program modules 1215.

Themass storage device 1207 is connected to thecentral processing unit 1201 through a mass storage controller (not shown) connected to thesystem bus 1205. Themass storage device 1207 and its associated computer-readable media provide non-volatile storage for thecomputer device 1200. That is, themass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, flash memory or other solid state storage technology, CD-ROM, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. Thesystem memory 1204 andmass storage device 1207 described above may be collectively referred to as memory.

Thecomputer device 1200 may be connected to the internet or other network devices through anetwork interface unit 1211 connected to thesystem bus 1205.

The memory further includes one or more programs, the one or more programs are stored in the memory, and thecentral processing unit 1201 implements all or part of the steps of the method shown in fig. 2, 3, or 4 by executing the one or more programs.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods shown in the various embodiments described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for determining a target area, the method comprising:

2. The method of claim 1, wherein prior to obtaining first estimated depth data based on the first color image data, further comprising:

obtaining a confidence corresponding to the first depth image data based on the first depth image data; the confidence corresponding to the first depth image data is used for indicating the accuracy of the image data corresponding to the target area in the first depth image data;

obtaining first calibration depth data based on the first estimated depth data and the first depth image data, comprising:

3. The method of claim 2, wherein obtaining the confidence level corresponding to the first depth image data based on the first depth image data comprises:

said obtaining first estimated depth data based on said first color image data comprises:

4. The method of claim 3, further comprising:

acquiring a first training sample set; the first training sample set comprises a first sample image and a confidence type corresponding to the first sample image;

based on the first sample image, performing data processing through the confidence coefficient discrimination model to obtain a confidence probability corresponding to the first sample image; the confidence probability is indicative of a probability that the first sample image is a positive sample;

and training the confidence coefficient distinguishing model based on the confidence probability corresponding to the first sample image and the confidence type corresponding to the first sample image.

5. The method of claim 4, wherein the obtaining a first set of training samples comprises:

acquiring a second training sample set; the second training sample set comprises training sample images and target areas corresponding to the training sample images; the training sample image comprises training color sample data and training depth sample data;

determining a color image processing branch in a model through a target area, and processing the training color sample data to obtain a prediction area corresponding to the training color sample data;

determining a depth image processing branch in a model through the target region, and processing the training depth sample data to obtain a prediction region corresponding to the training depth sample data;

determining a confidence score of the training sample image based on a prediction region corresponding to the training color sample data, a prediction region corresponding to the training depth sample data and a target region corresponding to the training sample image;

in response to the confidence score of the training sample image satisfying a specified condition, determining a confidence type of the training sample image and determining the training sample image as the first sample image;

6. The method of claim 5, wherein the confidence scores comprise a color confidence score and a depth confidence score;

determining a confidence score of the training sample image based on the prediction region corresponding to the training color sample data, the prediction region corresponding to the training depth sample data, and the target region corresponding to the training sample image, including:

determining a color confidence score corresponding to the training sample image based on the contact ratio between a prediction region corresponding to the training color sample data and a target region corresponding to the training sample image;

and determining a depth confidence score corresponding to the training sample image based on the contact ratio between the prediction region corresponding to the training depth sample data and the target region corresponding to the training sample image.

7. The method of claim 1, wherein the performing a weighting process based on the first color image data and the first calibration depth data to obtain a first fused feature map comprises:

based on the first color image data and the first calibration depth data, performing weighting processing based on an attention mechanism through a feature fusion branch in a target area determination model to obtain a first fusion feature map;

before determining the target region corresponding to the first image based on the first fusion feature map, the method further includes:

based on the first color image data, performing data processing through a depth image processing branch in the target area determination model to obtain a first color feature map;

based on the first depth image data, performing data processing through a depth image processing branch in a target area determination model to obtain a first depth feature map;

the determining a target region corresponding to the first image based on the first fusion feature map includes:

determining a target area corresponding to the first image based on the first fusion feature map, the first depth feature map and the first color feature map;

8. The method of claim 7, wherein the feature fusion branch comprises a first pooling layer, a second pooling layer, a first fully-connected layer, and a second fully-connected layer;

the obtaining the first fused feature map by performing attention-based weighting processing on a feature fusion branch in a target region determination model based on the first color image data and the first calibration depth data includes:

based on the first color image data, performing global pooling through a first pooling layer to obtain first color pooling data;

based on the first color pooling data, performing data processing through a first full-connection layer to obtain a first color vector;

performing global pooling through a second pooling layer based on the first depth image data to obtain first depth pooling data;

based on the first depth pooling data, performing data processing through a second full-link layer to obtain a first depth vector;

based on the first color image data and the first calibration depth data, performing channel attention weighting processing through a first color vector and a first depth vector to obtain the first fusion feature map; the first color vector is used for indicating the weight corresponding to the first color image data; the first depth vector is used to indicate a weight to which the first depth image data corresponds.

9. The method of claim 7, further comprising:

obtaining third sample estimated depth data based on the third sample color image data;

and training the target area determination model based on the third sample fusion feature map and the target area corresponding to the third sample image.

10. The method according to claim 9, wherein before training the target area determination model based on the third sample fusion feature map and the target area corresponding to the third sample image, the method further comprises:

based on the third sample color image data, performing data processing through a color image processing branch in the target area determination model to obtain a third sample color feature map;

based on the third sample depth image data, performing data processing through a depth image processing branch in the target area determination model to obtain a third sample color feature map;

the training of the target area determination model based on the third sample fusion feature map and the target area corresponding to the third sample image includes:

11. A method for determining a target area, the method comprising:

12. A target area determination apparatus, the apparatus comprising:

13. A target area determination apparatus, the apparatus comprising:

14. A computer device comprising a processor and a memory, said memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, said at least one instruction, said at least one program, said set of codes, or set of instructions being loaded and executed by said processor to implement a target area determination method as claimed in any one of claims 1 to 11.

15. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a target area determination method as claimed in any one of claims 1 to 11.