Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure, as shown in fig. 1, including:
 In step S11, feature extraction is performed on the color image and the corresponding depth image respectively to obtain N-level first feature images of the color image and N-level second feature images of the depth image, wherein the scales of all levels of first feature images in the N-level first feature images are different, the scales of all levels of second feature images in the N-level second feature images are different, N is an integer, and N is more than or equal to 2;
 In step S12, fusing an nth first feature map of the N-level first feature map with an nth second feature map of the N-level second feature map to obtain an nth third feature map of the N-level third feature map, wherein N is an integer and is more than or equal to 1 and less than or equal to N;
 In step S13, fusion processing is carried out on the N-level third feature images, and fusion feature images are obtained according to fusion processing results;
 in step S14, the processing results of the color image and the corresponding depth image are determined according to the fusion feature map.
In a possible implementation manner, the image processing method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like, and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Or the method may be performed by a server.
In one possible implementation, in step S11, the color image and the corresponding depth image may be an image of a real scene acquired by an image acquisition device (e.g., a depth camera). The color image may be a multi-channel image, for example, may be an RGB (Red, green, blue) image, and the Depth image may be a Depth Map (Depth Map) of the color image.
In one possible implementation, the color image and the corresponding depth image may be acquired by an image acquisition device, or may be acquired from a database or other device in which the color image and the corresponding depth image are stored. The embodiments of the present disclosure do not limit the source of the color image and the corresponding depth image.
In one possible implementation manner, in step S11, the multi-level feature extraction may be performed on the color image and the corresponding depth image respectively by performing a multi-level convolution operation on the color image and the corresponding depth image respectively, or may be performed on the color image and the corresponding depth image respectively by performing a multi-level convolution operation and a multi-level batch normalization (Batch Normalization) operation, so as to obtain an N-level first feature map of the color image and an N-level second feature map of the depth image respectively. Wherein, the multi-stage batch normalization (Batch Normalization) operation is performed, which may be performed after each stage of convolution operation.
In one possible implementation, the color image and the corresponding depth image may be subjected to a multi-stage convolution operation, i.e., multi-stage feature extraction, respectively, by a multi-stage convolution layer of the neural network. The multi-level convolution layer may include two sets of multi-level convolution layers, and each level of convolution layer may be followed by a batch normalization process. Each first feature map and each second feature map may be respectively the outputs of two groups of multi-level convolution layers at different levels, where the scale of the feature map (i.e. the resolution of the feature map, or the height and width of the feature map) of the convolution layer outputs at different levels decreases step by step. The feature maps output by the same level in the two groups of multi-level convolution layers have the same scale, so that the feature maps output by the same level can be fused in step S12.
For example, the RGB image and the corresponding depth image may be respectively subjected to multi-level feature extraction by two sets of 3-level convolution layers, i.e., the RGB image is subjected to feature extraction by one set of 3-level convolution layers, and the corresponding depth image is subjected to feature extraction by another set of 3-level convolution layers. Wherein the scale of the feature image output by each level of the 3 level convolution layers in the group can be, for example, 1/2, 1/4 and 1/8 of the color image in sequence, the scale of the feature image output by each level of the convolution layers in the other group of 3 level convolution layers can be, for example, 1/2, 1/4 and 1/8 of the depth image in sequence, the scale of the feature image output by the same level of the convolution layers in the two groups of 3 level convolution layers can be the same, for example, the scale of the first feature image output by the 1 level of the convolution layer in the group of 3 level convolution layersLevel 1 second feature map output from level 1 convolutional layer of the other set of level 3 convolutional layersThe scale of (c) is the same, and so on.
In one possible manner, as described above, the color image and the corresponding depth image may be respectively subjected to multi-level feature extraction by a multi-level convolution layer, and the scale of feature images output at the same level in the multi-level convolution layer may be the same. In step S12, the N-th first feature map of the N-th first feature map is fused with the N-th second feature map of the N-th second feature map, so that the first feature map and the second feature map with the same scale can be fused, and the N-th third feature map of the N-th third feature map is obtained. It can be understood that the N-level first feature map and the N-level second feature map are fused, so as to obtain an N-level third feature map correspondingly. For example, it is possible to map the 1 st level first feature mapAnd the 1 st level second characteristic diagramAnd fusing to obtain a third characteristic diagram X0,0 of the 1 st level, and analogizing the first characteristic diagram of the 3 level and the second characteristic diagram of the 3 level to obtain the third characteristic diagram of the 3 level.
In one possible implementation manner, the feature maps of the same scale may be fused by connecting the feature maps, i.e. connecting the feature maps in the channel direction, to generate a fused feature map, for example, connecting a first feature map of length×width×channel: 64×64×3 with a second feature map of 64×64×3 in the channel direction to obtain a third feature map of 64×64×9, or may also be generated by adding the pixel values of the corresponding channels in the feature maps, i.e. adding the pixel values of the first feature map of 64×64×3 with the second feature map of 64×64×3 on each corresponding channel, to obtain a third feature map of 64×64×3. The embodiment of the disclosure does not limit the fusion mode between the feature graphs.
In a possible implementation manner, in step S13, the N-level third feature map is subjected to fusion processing, which may be that the multiscale feature map of the N-level third feature map is extracted through an N-level fusion layer of the neural network, and then feature fusion is performed between the extracted multiscale feature maps. The N-level third feature maps may be inputs to N-level fusion layers of the neural network, respectively. The scale of the feature images output by different fusion layers is gradually reduced. Each stage of fusion layer can comprise a multi-stage convolution layer so as to perform multi-stage convolution operation on a third feature map input by the layer and extract multi-stage features. The convolution layers contained in the same-level fusion layer do not change the scale of the output feature map, namely, the scale (namely, resolution) of the feature map output by the convolution layers contained in the same-level fusion layer is the same. The scale of the feature map output by each stage of fusion layer is the same as the scale of the third feature map corresponding to the stage. Wherein each stage of convolution layer may be followed by a batch normalization process.
In one possible implementation, the N-level fusion layer may implement fusion processing of the N-level third feature map by adopting, but not limited to, operations such as scale-up, scale-down, jump connection, and transverse connection.
In one possible implementation manner, the scaling down can be realized through downsampling and convolution operation, the scaling down can be used for scaling down the feature map to 1/2 of the feature map, and then the scaled down feature map is used as input of a fusion process in a next-stage fusion layer, the scaling up can be realized through upsampling and convolution operation, and the scaling up can be used for amplifying the feature map to 2 times of the feature map, and then the amplified feature map is used as input of a fusion process in a previous-stage fusion layer. Feature graphs with different scales can be fused through scale up and scale down.
In one possible implementation, a jump connection may be used to achieve fusion of feature maps of non-adjacent convolutional layer outputs in the same fusion layer, and a cross connection may be used to achieve fusion of feature maps of adjacent convolutional layer outputs in the same fusion layer. The cross connection can be realized through convolution operation without changing the dimension of the feature map, that is, the feature map obtained through previous fusion processing is subjected to convolution operation without changing the dimension of the feature map, and then the feature map output through the convolution operation without changing the dimension of the feature map is used as input of the fusion processing. Fusing different feature maps of the same scale can be achieved through jump connection and transverse connection.
In one possible implementation, the fusion process result may include a feature map obtained by the fusion process. The feature map obtained according to the fusion processing result may be a feature map obtained by a certain fusion processing as a fusion feature map, for example, a feature map obtained by a last fusion processing of the third feature map of N levels may be used as a fusion feature map.
In one possible implementation, the neural network in embodiments of the present disclosure may be trained using a random gradient descent method, the batch size may be 64, and all parameters of the neural network may be randomly initialized. The training mode of the neural network is not limited by the embodiment of the disclosure.
In a possible implementation manner, in step S14, the processing result of determining the color image and the corresponding depth image according to the fusion feature map may be that the fusion feature map is directly subjected to segmentation processing to obtain a segmentation result, or the fusion feature map may be further processed, and the segmentation processing is performed according to the processed fusion feature map to obtain a segmentation result. Where the segmentation process may include semantic segmentation, instance segmentation, etc., embodiments of the present disclosure do not limit the manner in which the segmentation process may be performed.
In one possible implementation manner, the processing result of the color image and the corresponding depth image may be the above-mentioned segmentation result, or may be a result obtained by reprocessing the above-mentioned segmentation result according to an actual image processing task. For example, in the image editing task, the foreground region and the background region may be distinguished according to the segmentation result, and corresponding processing may be performed on the foreground region and/or the background region, for example, blurring processing may be performed on the background region, to obtain a final image processing result. The embodiment of the disclosure does not limit the specific content included in the segmentation mode and the processing result of the feature map.
It should be noted that, although the scenario of the segmentation process based on the fusion feature map as described above is described as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can completely use the fusion feature map for various image processing tasks according to actual application scenes, for example, the image classification task, the target detection task and the like can be performed based on the fusion feature map, wherein the image classification task can be to determine classification results of color images and corresponding depth images according to the fusion feature map, and the target detection task can be to determine target detection results of the color images and corresponding depth images according to the fusion feature map.
In the embodiment of the disclosure, the N-level first feature images and the N-level second feature images of the color images are obtained by respectively carrying out feature extraction on the color images and the corresponding depth images, so that semantic information of the color images with certain depth and semantic information of the depth images can be respectively extracted, the N-level first feature images and the N-level second feature images of the N-level first feature images are fused to obtain the N-level third feature images of the N-level third feature images, semantic information contained in the obtained N-level third feature images is more accurate and critical, fusion processing is carried out on the N-level third feature images, fusion feature images are obtained according to the fusion processing, the fusion feature images obtained through the fusion processing contain richer and more accurate semantic information without excessive loss of resolution, and the accuracy of image processing can be improved by determining the processing results of the color images and the depth images according to the fusion feature images.
In a possible implementation manner, in step S13, the fusing processing is performed on the N-level third feature map, and the obtaining a fused feature map according to a fusion processing result may include:
 taking the first-stage third characteristic diagram as a first fourth characteristic diagram of the first stage;
 Carrying out fusion treatment on the third characteristic diagram of the kth stage to obtain a first fourth characteristic diagram of the kth stage, wherein k is an integer and is not more than 1<k;
 and carrying out fusion processing on the first and fourth feature maps of each stage, and obtaining a fusion feature map according to a fusion processing result.
In one possible implementation manner, the third feature map of the kth level is fused to obtain the first fourth feature map of the kth level, which may be a feature map obtained by fusing the first fourth feature map of the k-1 level and the third feature map of the kth level. The method can process the first fourth characteristic diagram of the k-1 level into the characteristic diagram with the same scale as the third characteristic diagram of the k level in a scale reduction mode, so that fusion among the characteristic diagrams with different scales is realized.
In one possible implementation, the first fourth feature map of each stage is subjected to fusion processing, which may be to fusion processing for multiple times on the fourth feature maps of the 1 st stage to the N-th stage. As described above, the fusion processing result may be a feature map obtained by the fusion processing. For multiple fusion processing, multiple feature maps can be obtained, and then the fusion feature map is determined according to the multiple feature maps.
As described above, the fusion processing of the N-level third feature map may be implemented through a neural network. Fig. 2 shows a schematic structural diagram of a neural network according to an embodiment of the present disclosure. The left network in the neural network as shown in fig. 2 may be used to derive a level 4 third feature map from the RGB image and the depth image. The right network is used for fusing the third characteristic diagrams of the 4 level, wherein the right network adopts the connection modes of scale enlargement, scale reduction, jump connection, transverse connection and the like to realize the fusion between the third characteristic diagrams,Representing a first characteristic diagram of the image of the object,Representing a second characteristic diagram of the image of the object,Representing the third feature map, X0,0~X3,4 each represent a fourth feature map, in which X7,0 may be a fused feature map.
For ease of understanding, the above-described fusion process is described using a neural network as shown in fig. 2 as an example. Wherein the first stage third feature map is used as the first stage first fourth feature map, which can be thatAs X0,0, i.eThe method comprises the steps of obtaining a first fourth characteristic diagram of a kth level by fusing the third characteristic diagram of the kth level with X0,0, wherein the first fourth characteristic diagram of the kth level can be obtained by, for example, fusingFusion treatment is carried out to obtain X1,0, pairPerforming fusion treatment to obtain X2,0, forAnd performing fusion processing on the first fourth feature images of each stage, namely performing multiple fusion processing on the X0,0~X3,0 to obtain a plurality of fourth feature images, and further obtaining a fusion feature image according to the plurality of fourth feature images.
In the embodiment of the disclosure, the N-level third feature images are subjected to fusion processing to obtain N-level first fourth feature images, and then the first fourth feature images at all levels are subjected to fusion processing, so that fusion feature images are obtained according to fusion processing results, and the fusion feature images can be effectively obtained.
In one possible implementation manner, the fusing processing is performed on the third feature map of the kth level to obtain the first fourth feature map of the kth level, which may include:
 Performing scale reduction on the first fourth feature map of the k-1 level to obtain a first fifth feature map of the k level;
 And carrying out fusion processing on the third characteristic diagram of the kth level and the first fifth characteristic diagram of the kth level to obtain the first fourth characteristic diagram of the kth level.
To facilitate understanding of the above fusion process, a neural network shown in fig. 2 is taken as an example for illustration. The method comprises the steps of performing scale reduction on a k-1 st-level first fourth feature map to obtain a k-level first fifth feature map, for example, performing scale reduction on a 2 nd-level first fourth feature map X1,0 to obtain a 3 rd-level first fifth feature map, and performing fusion processing on the 3 rd-level first fifth feature map and the 3 rd-level third feature map to obtain a 3 rd-level first fourth feature map X2,0. The process of obtaining the first and fourth feature maps of other levels is not described herein.
In the embodiment of the disclosure, the first fourth feature map of the kth stage can be effectively obtained according to the third feature map of the kth stage.
In a possible implementation manner, the fusing processing on the first and fourth feature maps of each stage may include:
 And carrying out M-1 th fusion treatment on the first fourth feature map of the nth stage to obtain an mth fourth feature map of the nth stage, wherein M is an integer and 1<m is less than or equal to M-n+2, and M is the number of times of fusion treatment on the first fourth feature map of the 1 st stage.
In one possible implementation, M may be a preset number of fusion processes performed by the level 1 fusion layer of the neural network, that is, a number of fusion processes performed on the level 1 first fourth feature map. The fusion processing times of the whole neural network are related to the fusion processing times of the 1 st fusion layer.
In one possible implementation manner, the number of times of fusion processing performed on the first fourth feature map of the stage by the 1 st stage to the nth stage fusion layer can be set to be gradually reduced, so that the neural network can use less downsampling and learn more on the high-resolution feature map.
In one possible implementation manner, after setting the number of times of fusion processing performed on the first fourth feature map of the 1 st level by the 1 st level fusion layer, the number of times of fusion processing performed on the first fourth feature map of the n-th level by the n-th level fusion layer may be determined according to M, and may include that the number of times of fusion processing performed on the first fourth feature map of the n-th level by the n-th level fusion layer is M-n+1. For example, for the 3 rd level fusion layer, if the 1 st level fusion layer performs 7 times of fusion processing on the 1 st level first fourth feature map, that is, m=7, the 2 nd level fusion layer performs 6 times of fusion processing, and the 3 rd level fusion layer performs 5 times of fusion processing.
To facilitate understanding of the above fusion process, a neural network shown in fig. 2 is taken as an example. For example, the first fourth feature map X1,0 of the 2 nd stage is subjected to the 1 st fusion processing to obtain the 2 nd fourth feature map X1,1 of the 2 nd stage, then the 2 nd fusion processing is performed to obtain the 3rd fourth feature map X1,2 of the 2 nd stage, and so on until the 6 th fusion processing is performed to obtain the 7 th fourth feature map X1,6 of the 2 nd stage.
In one possible implementation, it is contemplated that the high resolution feature map contains more image detail in determining the processing results of the corresponding image from the feature map. In step S13, obtaining the fusion feature map according to the fusion processing result may include taking the m+1st fourth feature map (such as X0,7 in fig. 2) of the 1 st stage as the fusion feature map, that is, taking the last fourth feature map output by the 1 st stage fusion layer as the fusion feature map, so as to enable the determination processing result based on the high-resolution feature map to be achieved. Meanwhile, the M+1st fourth feature map of the 1 st level obtained by fusion through the multi-level fusion layer of the neural network also contains semantic information of different levels, so that the fusion feature map contains semantic features of a deeper level and is a high-resolution feature map.
In the embodiment of the present disclosure, since the m-1 th fusion process is performed on the first fourth feature map of the nth stage, the process of obtaining the m fourth feature map of the nth stage is complex, and for simplicity of behavior, the following will be described in detail.
In the embodiment of the disclosure, the M-1 fusion processing is performed on the first fourth feature map of the nth stage to obtain the mth fourth feature map of the nth stage, M is an integer and 1<m is less than or equal to M-n+2, so that the neural network can effectively use less downsampling, and more learning is performed on the high-resolution feature map, and further the obtained fusion feature map contains semantic features of a deeper level while resolution is not lost.
The following description is given of the fusion processing procedure of the nth m fourth feature map obtained by performing the m-1 th fusion processing on the nth first fourth feature map.
In one possible implementation manner, in response to the case where n=1, the m-1 th fusion process is performed on the first fourth feature map of the nth stage to obtain an mth fourth feature map of the nth stage, including:
 performing scale amplification on the (n+1) -th m-1 fourth feature map to obtain an (n) -th m-1 fifth feature map;
 and carrying out fusion processing on the first m-1 fourth feature map of the nth stage and the m-1 fifth feature map of the nth stage to obtain the m fourth feature map of the nth stage.
In one possible implementation, n=1 represents the 1 st fusion layer of the N-th fusion layers for the neural network.
Fig. 2 is a schematic structural diagram of a neural network according to an embodiment of the present disclosure, and for convenience of understanding the above-mentioned process of obtaining the nth-level mth fourth feature map, the neural network shown in fig. 2 is described as follows:
 For example, m=2, namely, a fourth feature map X0,1 of level 1 (i.e., a1 st fusion process for the first fourth feature map of level 1) is obtained, a scale of the fourth feature map X1,0 of level 2 is enlarged to a fifth feature map of level 1 having the same scale as that of the fourth feature map X0,1, and the fifth feature map of level 1 is fused with the first fourth feature map X0,0 of level 1 to obtain a fourth feature map X0,1 of level 1.
It can be understood that the above description uses the fourth feature map X0,1 obtained by fusion as an example, and describes the 1 st fusion processing procedure for the 1 st stage first fourth feature map in the case of n=1, and the other times of fusion processing procedures for the first fourth feature map are the same as the above description, and are not repeated here.
In the embodiment of the disclosure, the first fourth feature map of the 1 st level is subjected to multiple fusion processing, so that multiple fusion between the fourth feature map of the 1 st level and the feature maps of different scales is realized, sufficient fusion between the feature maps can be realized, and further, the fused feature map with high resolution and rich semantics is obtained.
In one possible implementation manner, in response to a situation that 1< n, performing an m-1 th fusion process on the first fourth feature map of the nth stage to obtain an m fourth feature map of the nth stage, where the method includes:
 performing scale reduction on the nth-1 mth fourth feature map to obtain an nth mth sixth feature map;
 performing scale amplification on the (n+1) -th m-1 fourth feature map to obtain an (n) -th m-1 fifth feature map;
 And carrying out fusion processing on the first m-1 fourth feature map of the nth stage, the m-1 fifth feature map of the nth stage and the m sixth feature map of the nth stage to obtain the m fourth feature map of the nth stage.
In one possible implementation, 1< N represents other hierarchical fusion layers than the 1 st and nth fusion layers of the N-level fusion layers for the neural network.
In order to facilitate understanding of the process of obtaining the mth fourth feature map of the nth stage in response to 1< n in the embodiment of the present disclosure, the neural network shown in fig. 2 is described as follows:
 For example, n=3 and m=3, that is, the 3 rd fourth feature map X2,2 of the 3 rd level (i.e., the 2 nd fusion process for the first fourth feature map of the 3 rd level) is obtained. The method comprises the steps of performing scale reduction on a 3 rd-level fourth characteristic diagram X1,2 in a 2 nd level to obtain a 3 rd sixth characteristic diagram of the 3 rd level, which is the same as the fourth characteristic diagram X2,2 in scale, performing scale amplification on a 2 nd fourth characteristic diagram X3,1 in a 4 th level to obtain a 2 nd sixth characteristic diagram of the 3 rd level, which is the same as the fourth characteristic diagram X2,2 in scale, and fusing the 3 rd sixth characteristic diagram of the 3 rd level, the 2 nd sixth characteristic diagram of the 3 rd level and the first 2 fourth characteristic diagrams of the 3 rd level (namely X2,0 and X2,1) to obtain a 3 rd fourth characteristic diagram X2,2 of the 3 rd level.
It can be understood that the above description is given by taking the fourth feature map X2,2 obtained by fusion as an example, and in the case of 1< n, the 2 nd fusion processing procedure for the first fourth feature map of the 3 rd stage is the same as the above description, and the other times of fusion processing procedure for the first fourth feature map of the n-th stage is not repeated here.
In the embodiment of the disclosure, the feature images with different scales can be fused by carrying out multiple fusion processing on the first fourth feature image of the nth stage, so that the feature images can be fully fused, and further the fused feature images with high resolution and rich semantics are obtained.
In one possible implementation manner, in response to the case of n=n, the m-1 th fusion process is performed on the first fourth feature map of the nth stage to obtain an mth fourth feature map of the nth stage, including:
 performing scale reduction on the nth-1 mth fourth feature map to obtain an nth mth sixth feature map;
 And carrying out fusion processing on the first m-1 fourth feature map of the nth stage and the mth sixth feature map of the nth stage to obtain the mth fourth feature map of the nth stage.
In one possible implementation, n=n represents an nth fusion layer of the N-th fusion layers for the neural network.
To facilitate understanding of the above-described procedure of obtaining the mth fourth feature map of the nth stage in response to n=n, the neural network structure shown in fig. 2 is exemplified as follows:
 For example, m=3, that is, the 3 rd fourth feature map X3,2 of the 4 th level (i.e., the 2 nd fusion process for the first fourth feature map of the 4 th level) is obtained. The 3 rd fourth feature map X2,2 of the 3 rd stage is scaled down to the 3 rd sixth feature map of the 4 th stage, which is the same scale as the fourth feature map X3,2, and the 3 rd sixth feature map of the 4 th stage and the first 2 fourth feature maps of the 4 th stage (namely, X3,0 and X3,1) are fused to obtain the 3 rd fourth feature map X3,2 of the 4 th stage.
It can be understood that, in the case where n=n is taken as an example to obtain the fourth feature map X3,2 by fusion, the 2 nd fusion processing procedure for the first fourth feature map of the 4 th level is the same as the above example, and the other times of fusion processing procedure for the first fourth feature map of the N th level is not repeated here.
In the embodiment of the disclosure, fusion processing is performed on the first fourth feature map of the nth stage for multiple times, so that fusion between deep semantic information and shallow semantic information can be realized, and further a fusion feature map with high resolution and rich semantics is obtained.
In the embodiment of the present disclosure, the fifth feature map may refer to a feature map obtained by scaling up the fourth feature, and the sixth feature map may refer to a feature map obtained by scaling down the fourth feature map.
In one possible implementation, as described above, the fused feature map may be further processed, and the processed fused feature map may be subjected to a segmentation process, to obtain a segmentation result. In step S14, determining the processing result of the color image and the corresponding depth image according to the fusion feature map may include:
 The fused feature images are subjected to scale up to obtain segmentation feature images, and the height and width of the segmentation feature images are the same as those of the color images;
 and according to the segmentation feature map, segmenting the color image and the corresponding depth image to obtain a processing result of the color image and the corresponding depth image.
In one possible implementation, the upscaling may be achieved by upsampling and convolution operations. By upscaling, the fused feature map can be enlarged to the same resolution (i.e., height and width) as the color image and corresponding depth image. The amplified fusion feature map is the segmentation feature map.
In one possible implementation manner, the color image and the corresponding depth image are segmented according to the segmentation feature map, so as to obtain a processing result of the color image and the corresponding depth image, wherein the processing result can be obtained by directly segmenting the segmentation feature map to obtain a segmentation result, or can be obtained by reprocessing the segmentation result according to an actual image processing task. The segmentation process may include semantic segmentation, instance segmentation, and the like, among others. The embodiment of the disclosure does not limit the specific content included in the segmentation method and the processing result of the segmentation feature map.
In the embodiment of the disclosure, the segmentation feature map is obtained by carrying out scale amplification on the fusion feature map, and the image processing result is determined according to the segmentation feature map, so that the image processing based on the fusion feature map with high resolution and rich semantics can be realized, and the accuracy of the image processing is improved.
At present, related studies of semantic segmentation and instance segmentation using RGB-D images are becoming a trend. Most of RGB-D semantic segmentation schemes in the related technology have the defects of inaccurate segmentation, poor real-time performance, complex network structure, poor reproducibility and the like.
In view of the above, according to the image processing method of the embodiment of the present disclosure, real-time RGB-D semantic segmentation with high accuracy can be achieved.
Fig. 2 shows a schematic diagram of a neural network structure according to an embodiment of the present disclosure. A neural network structure as shown in fig. 2, the neural network of the disclosed embodiments may be a fully convoluted neural network. The network is input as an RGB image and a depth image, which are mutually independent and are fused at a plurality of different levels after feature extraction through a plurality of convolution layers in the right network.
There are a large number of connections between different hierarchical feature layers (fusion network layers) of the left-hand network of the neural network to achieve multi-scale fusion and feature extraction. The deeper the feature layer, the higher its semantic level, the lower the resolution, and the shallower the feature layer, the lower its semantic level, the higher the resolution, and more accurate the positioning.
By connecting feature layers, the embodiment of the disclosure can generate the fine resolution feature map with strong semantic information to construct a network with fine resolution and high resolution capability. The connection comprises a transverse connection, a downward connection, an upward connection and a jump connection, wherein the transverse connection is a convolution operation which does not change the size of the feature map, the downward connection is used for reducing the feature map to 1/2 of the original feature map through convolution and downsampling, the upward connection is used for amplifying the feature map to 2 times of the original feature map through convolution and upsampling, and the jump connection is used for connecting the non-adjacent feature maps. Compared with the feature fusion mode in the related art, the embodiment of the disclosure adopts a richer connection mode, increases the connection between different feature graphs, and realizes more complete fusion between the feature graphs.
In one possible implementation Batch Normalization (batch normalization, i.e., batch normalization) is used after each convolutional layer. During training, a random gradient descent method was used for training, with a batch size of 64, and all parameters were initialized randomly.
In the embodiment of the disclosure, the method can learn through multi-layer convolution on the characteristic diagram with higher resolution so as to avoid the reduction of positioning accuracy caused by irreversible resolution loss due to excessive downsampling. Compared with the image fusion mode in the related art, the image processing method of the embodiment of the disclosure can use less downsampling and more high-resolution feature map learning, and the accuracy of final semantic segmentation is effectively improved.
In the embodiment of the disclosure, the high resolution precision of the finally output segmentation result is ensured by carrying out feature extraction and learning on the feature map with higher resolution, and the common learning and mutual complementation of the network to the high-level semantic information and the low-level positioning information are ensured by the fusion of the feature maps with different levels.
In the embodiment of the disclosure, a richer connection mode is adopted, so that the connection between different feature images is increased, more sufficient fusion is realized, and the feature images with strong semantic information and fine resolution are generated, thereby constructing a network with fine resolution and high resolution capability.
It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the disclosure further provides an image processing apparatus, an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the image processing methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.
Fig. 3 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure, as shown in fig. 3, the apparatus including:
 the feature extraction module 101 is configured to perform feature extraction on a color image and a corresponding depth image, so as to obtain N-level first feature images of the color image and N-level second feature images of the depth image, where the scales of the first feature images at each level in the N-level first feature images are different, the scales of the second feature images at each level in the N-level second feature images are different, N is an integer, and N is greater than or equal to 2;
 The first fusion module 102 is configured to fuse an nth first feature map of the N-level first feature map with an nth second feature map of the N-level second feature map to obtain an nth third feature map of the N-level third feature map, where N is an integer and is 1-N;
 The second fusion module 103 is configured to perform fusion processing on the N-level third feature graphs, and obtain a fusion feature graph according to a fusion processing result;
 and the processing module 104 is configured to determine a processing result of the color image and the corresponding depth image according to the fusion feature map.
In a possible implementation manner, the second fusion module 103 includes a first processing sub-module, a first fusion sub-module, and a second fusion sub-module, wherein the first processing sub-module is configured to use the first-stage third feature map as the first-stage first fourth feature map, the first fusion sub-module is configured to perform fusion processing on the kth-stage third feature map to obtain a kth-stage first fourth feature map, k is an integer and 1<k is less than or equal to N, and the second fusion sub-module is configured to perform fusion processing on the first-stage fourth feature map, and obtain a fusion feature map according to a fusion processing result.
In one possible implementation manner, the first fusion submodule comprises a first shrinking unit and a first fusion unit, wherein the first shrinking unit is used for carrying out scale shrinking on a first fourth characteristic diagram of a k-1 level to obtain a first fifth characteristic diagram of the k level, and the first fusion unit is used for carrying out fusion processing on the third characteristic diagram of the k level and the first fifth characteristic diagram of the k level to obtain the first fourth characteristic diagram of the k level.
In one possible implementation manner, the second fusion submodule comprises a second fusion unit, a first fusion processing unit and a second fusion processing unit, wherein the second fusion unit is used for carrying out M-1 th fusion processing on the first fourth characteristic diagram of the nth stage to obtain the mth fourth characteristic diagram of the nth stage, M is an integer and 1<m is less than or equal to M-n+2, and M is the number of times of fusion processing on the first fourth characteristic diagram of the nth stage.
In one possible implementation manner, the second fusion unit responds to the condition that n=1, and comprises a first amplification subunit, a first fusion subunit and a second fusion subunit, wherein the first amplification subunit is used for carrying out scale amplification on the (n+1) th m-1 th fourth feature map to obtain the (n) th m-1 th fifth feature map, and the first fusion subunit is used for carrying out fusion processing on the (n) th m-1 th fourth feature map and the (n) th m-1 th fifth feature map to obtain the (n) th m-th fourth feature map.
In one possible implementation manner, in response to the situation that 1< n < N, the second fusion unit comprises a first reduction subunit, a second amplification subunit and a second fusion subunit, wherein the first reduction subunit is used for carrying out scale reduction on an nth-1 th-level mth fourth feature map to obtain an nth-level mth sixth feature map, the second amplification subunit is used for carrying out scale amplification on an nth-1 th-level mth-1 fourth feature map to obtain an nth-level mth-1 fifth feature map, and the second fusion subunit is used for carrying out fusion processing on the first mth-1 th fourth feature map of the nth-level, the nth-level mth-1 fifth feature map and the nth-level mth sixth feature map to obtain the nth-level mth fourth feature map.
In one possible implementation manner, in response to the situation that n=n, the second fusion submodule includes a second shrinking subunit, configured to scale-shrink the nth-1 mth fourth feature map to obtain the nth-1 mth sixth feature map, and a third fusion subunit, configured to fuse the first mth-1 th fourth feature map of the nth stage with the nth-mth sixth feature map to obtain the nth-mth fourth feature map.
In one possible implementation manner, the second fusion submodule comprises a first processing unit, and the first processing unit is used for taking the M+1st fourth characteristic diagram of the 1 st stage as a fusion characteristic diagram.
In one possible implementation manner, the processing module 104 includes a second processing sub-module configured to scale-up the fusion feature map to obtain a segmentation feature map, where the height and width of the segmentation feature map are the same as those of the color image, and a segmentation sub-module configured to segment the color image and the corresponding depth image according to the segmentation feature map to obtain a processing result of the color image and the corresponding depth image.
In the embodiment of the disclosure, the N-level first feature images and the N-level second feature images of the color images are obtained by respectively carrying out feature extraction on the color images and the corresponding depth images, so that semantic information of the color images and semantic information of the depth images with certain depth can be respectively extracted, the N-level first feature images and the N-level second feature images of the N-level first feature images are fused to obtain the N-level third feature images of the N-level third feature images, semantic information contained in the obtained N-level third feature images is more accurate and critical, fusion processing is carried out on the N-level third feature images, and fusion feature images are obtained according to the fusion processing, so that the fusion feature images obtained through the fusion processing contain richer and more accurate semantic information without excessive loss of resolution, and the accuracy of image processing can be improved by determining the processing results of the color images and the depth images according to the fusion feature images.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.
The embodiment of the disclosure also provides electronic equipment, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to call the instructions stored by the memory so as to execute the method.
The disclosed embodiments also provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the image processing method as provided in any of the embodiments above.
The disclosed embodiments also provide another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the image processing method provided in any of the above embodiments.
The electronic device may be provided as a terminal, server or other form of device.
Fig. 4 illustrates a block diagram of an electronic device 800, according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to FIG. 4, the electronic device 800 can include one or more of a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to, a home button, a volume button, an activate button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a photosensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.
Fig. 5 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 5, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows ServerTM), the apple Inc. promoted graphical user interface-based operating system (Mac OS XTM), the multi-user, multi-process computer operating system (UnixTM), the free and open source Unix-like operating system (LinuxTM), the open source Unix-like operating system (FreeBSDTM), or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.