Detailed Description
Illustrative embodiments of the present application include, but are not limited to, image processing methods, readable storage media, program products, and vehicle-mounted devices.
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be described in detail below with reference to the accompanying drawings and specific embodiments of the present application.
In order to facilitate understanding, some of the terms and related techniques involved in the present application are explained below.
BEV technology:
BEV technology is a technology that converts image information from an image space to a BEV space by a neural network, by which a complex three-dimensional environment can be reduced to a two-dimensional image. For example, in the field of autopilot, a panoramic view looking down from above a vehicle can be generated by BEV technology based on spatial information about the vehicle, thereby comprehensively exhibiting the environment around the vehicle, including front, rear, left, right, and top conditions, which enables the autopilot system to better understand the surrounding environment, improving the accuracy of perception and decision making.
Next, a process of generating BEV space by the on-board device of the vehicle is described.
For example, FIG. 1A shows a flow chart for generating BEV space.
It is understood that the following processes may be performed by the vehicle-mounted device, and the vehicle-mounted device in the embodiment of the present application may also be referred to as a vehicle-mounted terminal. The in-vehicle terminal may be a mobile phone (mobile phone), a car machine, a terminal in an unmanned vehicle (SELF DRIVING), a wireless terminal in transportation safety (transportation safety), a terminal in a smart city (SMART CITY), and the like, which will be described below as an example. However, it is understood that the technical solution described in the present application is applicable to the above-mentioned various vehicle-mounted electronic devices for three-dimensional object detection, and is not limited to the vehicle.
S101, acquiring sensor data acquired by a plurality of sensors.
In some embodiments of the present application, a vehicle machine deployed on the vehicle 10 may acquire sensor data acquired by a plurality of sensors.
For example, FIG. 1B shows a schematic diagram of a vehicle during travel.
It is appreciated that the vehicle 10 is typically equipped with a plurality of sensors, such as cameras (e.g., including front view, side view, rear view cameras, etc.), radar, lidar, and the like. As shown in fig. 1B, the vehicle 10 may acquire a plurality of sensor data at the same time during traveling by a time stamp synchronization mechanism or a hardware synchronization mechanism. The sensor data may be, for example, camera-captured image data of different perspectives of the vehicle 10, and/or radar-captured point cloud data of various obstacles, etc.
For example, the vehicle 10 may collect environmental data in the vicinity of the vehicle 10 through sensors, and referring to fig. 1B, the environmental data collected by the sensors may include, for example, sensor data of the first vehicle 01, the pedestrian 02, the first building 03, the second vehicle 04, and the second building 05. The vehicle machine on the vehicle 10 can acquire environmental data of various perspectives acquired by the sensors on the vehicle 10 at the same time.
S102, preprocessing the sensor data.
For example, the sensor data may be pre-processed after the sensor data is acquired by a vehicle machine on the vehicle 10. For example, in the image processing process, processing such as denoising and distortion correction can be performed on an image acquired by a camera. In the process of point cloud processing, filtering, denoising and segmentation can be performed on the laser radar and/or the point cloud data acquired by the radar. Then, at the same point in time when the data of the different sensors are aligned, data consistency is ensured.
S103, coordinate conversion is performed on the sensor data.
For example, the vehicle may also coordinate convert sensor data collected by the sensor to convert the sensor data from a sensor coordinate system to a vehicle coordinate system. For example, the camera, radar, lidar data is converted to a vehicle coordinate system centered on the vehicle 10.
The sensor data converted to the vehicle coordinate system is then converted from the vehicle coordinate system to the BEV coordinate system, for example, the sensor data in the vehicle coordinate system is converted to a bird's eye view coordinate system (top view).
Wherein, can pass the data that the camera gathered through reverse perspective transformation (INVERSE PERSPECTIVE MAPPING, IPM) change the image from perspective view into overlooking view. Radar and lidar data may map 3D point cloud data to a 2D top view by projection.
It will be appreciated that in processing the sensor data, the vehicle machine on the vehicle 10 may process only the sensor data within a predetermined range of the vehicle 10, such as a circular range with the radius R centered on the vehicle 10.
It is understood that in other embodiments, the range of data collected by the sensors of the vehicle 10 may be other shapes, and that the shape of the range of data collected by the sensors of the vehicle 10 is not limited by the embodiments of the present application.
S104, fusing the sensor data.
For example, during sensor data fusion, the vehicle may perform feature extraction on the sensor data. For example, the vehicle may extract features such as lane lines and obstacles from images acquired by cameras, or extract information such as object position and speed from radar and laser radar point clouds. And then the vehicle-mounted machine performs data fusion, so that the characteristics of different sensors are fused.
S105, generating a BEV graph according to the fused sensor data.
Illustratively, in some embodiments, the vehicle may map the fused data into a two-dimensional grid map, each grid representing a range of space (e.g., 0.1 m). In some embodiments, the vehicle may also perform semantic information additions (e.g., annotating lane lines, obstacles, pedestrians, vehicles, etc., semantic information in the BEV map) and dynamic updates (e.g., dynamically adjusting the BEV map based on real-time updates of sensor data), etc. For example, the vehicle-to-machine BEV map may also refer to the view of fig. 1B.
It will be appreciated that in the above-described generation of the BEV map, only sensor data for the space within the radius R of the vehicle 10 is processed. Therefore, if the detection range of the vehicle 10 needs to be increased (R becomes large), the sensor data in the space to be processed will be more, thereby increasing the computational power demand of the vehicle machine of the vehicle 10. For example, if the detection range of the vehicle 10 increases, the data processed in the above-described processes of S102 to S105 increases, thereby increasing the computational power demand of the vehicle machine of the vehicle 10.
For example, in the process of S104, in order to increase the detection range of the vehicle 10, the sampling points of the vehicle machine in the sensor data need to be denser to obtain the features of the sensor data more accurately, and therefore, the calculation force demand of the vehicle machine on the sensor is higher. In the process of S105, since the features of the sensor data extracted by the vehicle are more, the vehicle projects the features of the fused sensor data to the BEV space with higher computation power.
As mentioned above, as the requirements of vehicles on the detection range of the surrounding environment are larger, the data required to be processed in the process of detecting the targets in the three-dimensional space are more and more, so that the calculation force requirement of the vehicle is higher, and the corresponding technology is difficult to deploy on the vehicle-end platform.
In order to solve the problem that the computational effort required by a vehicle to generate three-dimensional space data is high, the application provides an image processing method, and vehicle-mounted equipment stores first space data of N first obstacle models, wherein the first space data of each first obstacle model comprises first position data of M characteristic points of the first obstacle model. The method comprises the following steps:
the in-vehicle apparatus acquires, in response to a request to generate a first environment image of the vehicle, a plurality of images of different perspectives acquired by a plurality of sensors of the vehicle, the plurality of images including an image of an actual obstacle around the vehicle.
The vehicle-mounted device projects the first space data of the N first obstacle models into the coordinate space of the plurality of images, and samples the characteristics of the plurality of images according to the first space data to obtain the first characteristics of the N first obstacle models. The vehicle-mounted device selects N1 first obstacle models from N first obstacle models according to the plurality of images, wherein the similarity between the first features of the N1 first obstacle models and the features of the actual obstacle is larger than the similarity between the first features of the other first obstacle models and the features of the actual obstacle.
The vehicle-mounted equipment optimizes the first space data of the N1 first obstacle models and the second space data of the N2 historical obstacle models to obtain third space data of the N1 first obstacle models and the N2 historical obstacle models, wherein N=N1+N2 and each third space data comprises second position data of M characteristic points corresponding to the obstacle models, and the similarity between the second characteristics obtained after the characteristics of the plurality of images are sampled according to the second position data of each characteristic point in the third space data and the characteristics of the actual obstacle is larger than a set threshold value. The in-vehicle device generates a first environment image according to the second feature.
By the scheme, the vehicle-mounted device does not need to generate the first environment image based on the features corresponding to a large number of sensor data, but generates the first environment image based on the optimized M feature points in the N1 preset first obstacle models and the acquired second features of the optimized M feature points in the N2 historical obstacle models in the multiple images. Because each obstacle model (including the first obstacle model and the historical obstacle model) can represent the space information (such as the space size, the position and the rotation angle of the obstacle model) of the complete obstacle model through fewer feature points, more calculation force is not required to be occupied in the process of acquiring the second feature of the vehicle, and therefore the calculation force requirement of the vehicle is reduced.
In some embodiments of the application, the plurality of first obstacle models stored by the in-vehicle device may include objects common during traffic, such as buildings, people, animals, traffic lights, vehicles (including cars, or motorcycles, bicycles, etc.), trees, rivers, etc. The first spatial data of the first obstacle may be first positions of M feature points of the first obstacle model. The M feature points of the first obstacle model are, for example, M points on three coordinate axes orthogonal to each other in a coordinate space of the first obstacle model, and first position data of the M feature points are used to represent an initial spatial dimension (for example, length, width, height data), a position, a rotation angle, and the like of the first obstacle model. The spatial dimension of the first obstacle model may be an average dimension corresponding to an object commonly found during traffic travel. For example, the first obstacle model may be a human in which the average height of an adult male is 1.75m and the average height of an adult female is 1.62m. The first obstacle model may be a sedan car having a length of about 4.0 meters to 4.5 meters, a width of about 1.7 meters to 1.8 meters, and a height of about 1.4 meters to 1.5 meters.
For example, FIG. 2 shows a schematic diagram of a feature point according to some embodiments of the application.
As shown in fig. 2, in the coordinate space of the first obstacle model, the first obstacle model may be regarded as one rectangular block, and the geometric center of the rectangular block may be the coordinate center of the first obstacle model. In some embodiments of the present application, taking m=13 as an example, 13 feature points may be established by the dimension of the first obstacle model in the three-dimensional direction. The 13 feature points may include 5 feature points that are respectively equally distributed along a longitudinal axis, a width axis, and a height axis of the first obstacle model. Among the 5 feature points in each coordinate axis direction, the distance between the 2 feature points at the two ends is the size of the first obstacle model in the direction in the coordinate. It can be understood that since the feature point at the geometric center of the first obstacle passes through three coordinate axes, the feature point at the geometric center of the first obstacle is repeatedly calculated twice, and the first obstacle model includes 13 feature points in total.
It will be appreciated that in other embodiments, the first obstacle model may also feature other numbers, e.g., M may be 7, 19, etc. Or the feature points of the first obstacle model are distributed in other ways, as long as spatial data such as the size and the rotation angle of the first obstacle model can be represented, and the number and the positions of the feature points of the first obstacle model are not limited in the embodiment of the present application.
Next, a process of generating a first environment image by the vehicle-mounted terminal based on data acquired by a sensor on the vehicle in the embodiment of the present application will be described.
FIG. 3 illustrates an implementation flow chart for generating an image of a vehicle environment, in accordance with an embodiment of the present application.
Illustratively, in some embodiments of the present application, the vehicle-mounted device has stored therein first spatial data of N first obstacles, the first spatial data of each first obstacle model including first position data of M feature points of the first obstacle model. For example, in embodiments of the present application, where N is greater than 384, the first position data is used to identify an initial spatial dimension (e.g., length, width, height data), position, rotation angle, etc., of the first obstacle model, which may be an average dimension corresponding to an object commonly found during traffic travel. In the initial state, the positions of the N first obstacles may be uniformly distributed, and the rotation angles of the N first obstacles may be a preset value, for example, 0 °.
It is understood that if N is large, the in-vehicle apparatus takes a long time to generate the environmental data of the vehicle, but the accuracy of the generated environmental data of the vehicle is high. Therefore, when the number of N is specifically set, the number of N may be determined with reference to the calculation power and the accuracy requirement of the vehicle-mounted device, and the embodiment of the present application does not limit the number of N, and N may be 400, 500, 600, or the like, for example.
It is to be understood that the execution subjects of the following respective flows may be all in-vehicle devices arranged on a vehicle, and the execution subjects of the respective flows are not limited in the description of the following respective flows.
As shown in fig. 3, the process includes:
S301, acquiring a plurality of images in response to a request for generating a first environment image of a vehicle.
Illustratively, in some embodiments of the application, the plurality of images are images of different perspectives acquired by a plurality of sensors of the vehicle, the plurality of images including an image of an actual obstacle surrounding the vehicle.
For example, referring to fig. 1B, during travel of the vehicle 10, multiple images may be acquired at the same time by multiple sensors through a time stamp synchronization mechanism or a hardware synchronization mechanism. The plurality of images may be, for example, camera-captured image data of different perspectives of the vehicle 10, and/or radar-captured point cloud data of individual obstacles, etc. It is to be appreciated that in some embodiments of the present application, the plurality of images may include images of actual obstacles around the vehicle 10, for example, the actual obstacles may be the first vehicle 01, the pedestrian 02, the building 03, the second vehicle 04, the building 05, and so on.
After detecting the request for generating the first environment image, the vehicle-mounted device may acquire a plurality of images of different perspectives acquired by a plurality of sensors on the vehicle.
S302, the features of the plurality of images are sampled according to the first position data of each feature point in the first space data, and the first features of the N first barrier models are obtained.
In some embodiments of the present application, after the vehicle-mounted device acquires the plurality of images, first spatial data of the N first obstacle models may be projected into a coordinate space of the plurality of images, and then features of first positions of respective feature points of the N first obstacle models in the plurality of images may be acquired as first features. It will be appreciated that the coordinate space of the plurality of images may be determined from the camera on the vehicle, the position of the camera on the vehicle being different and the coordinate space of the corresponding images being different.
In some embodiments of the present application, the image captured by the camera on the vehicle 10 may also include images with distortions, so that the effects of image distortions may be adjusted during the projection of the N first obstacle models onto the spatial data of the plurality of images based on the first spatial data.
Illustratively, the vehicle-mounted device stores a distortion error in which the spatial size, position and rotation angle of the first obstacle model are mapped from the undistorted image to the distorted image, and the first spatial data of the first obstacle model is required to be subjected to the distortion error adjustment after being projected into the coordinate space of the distorted image.
For example, fig. 4A illustrates a schematic diagram of obtaining distortion errors, according to some embodiments of the application.
As shown in fig. 4A, the vehicle-mounted device may acquire the undistorted image pixel coordinates first, and then determine the distorted image pixel coordinates according to a distortion calculation formula, where the distortion calculation formula is determined mainly based on the physical characteristics of the camera lens and the imaging principle. The in-vehicle apparatus may determine the distortion calculation formula from the camera lenses of the respective cameras on the vehicle.
The in-vehicle apparatus may subtract the undistorted image pixel coordinates from the distorted image pixel coordinates to obtain a distortion offset table. The vehicle-mounted equipment stores the distortion offset table, and then the vehicle-mounted equipment can search out distortion errors from the distortion offset table, so that the calculation process of the vehicle-mounted equipment is reduced, and the calculation force requirement of the vehicle-mounted equipment is reduced.
It will be appreciated that in other embodiments the distortion offset table may be determined for other devices and then stored in the in-vehicle device.
Fig. 4B illustrates a schematic diagram of acquiring first features of N first obstacle models, according to some embodiments of the application.
As shown in fig. 4B, after obtaining the first spatial data of the N first obstacle models, the in-vehicle apparatus may determine a three-dimensional position of the first obstacle model projected into the space of the undistorted plurality of images, the three-dimensional position including a spatial size, a position, and a rotation angle of the first obstacle model in the spatial data of the plurality of images, that is, positions of respective feature points of the first obstacle model in the space of the plurality of images. And then the vehicle-mounted equipment inquires the distortion offset table to obtain a distortion error, and the vehicle-mounted equipment can adjust the three-dimensional position of the first obstacle model through the distortion error to obtain the three-dimensional position of the first obstacle in the distorted multiple image spaces. Features of the distorted plurality of images are then acquired based on the three-dimensional locations of the first obstacle models in the space of the distorted plurality of images, thereby obtaining first features of the N first obstacle models.
It will be appreciated that by distorting the distortion errors in the deviation table, the in-vehicle device need not calculate the distortion errors, thereby reducing the computational effort requirements in the in-vehicle device's generation of the first ambient image. So that the vehicle-mounted equipment can be better arranged on the vehicle-end platform.
It will be appreciated that hereinafter, the process of projecting the target space data of the target obstacle into the coordinate spaces of the plurality of images may also determine the three-dimensional position of the target obstacle projected into the coordinate spaces of the plurality of images by querying the distortion offset map. That is, in projecting various obstacle models to the coordinate space of the plurality of images, the three-dimensional positions of the corresponding obstacles in the distorted image coordinate space can be determined by referring to the distortion offset map, thereby reducing the calculation amount of the in-vehicle apparatus.
In some embodiments of the present application, in order to increase the operation speed of the in-vehicle apparatus, some in-vehicle apparatuses support only a low-precision data form input of an 8-bit integer (int 8). It will be appreciated that since the int8 quantization uses a smaller number of bits to represent data, it has a significant advantage in terms of storage requirements, and this saving of storage space can significantly reduce deployment costs. In some embodiments of the present application, the distortion error in the distortion offset table may be stored in the vehicle device by means of int8 quantization. But the accuracy of the distortion error of the int8 quantization is low, in order to guarantee the accuracy of the distortion error, in some embodiments of the application, the distortion error of a high-accuracy 16-bit integer (int 16) quantization can be represented by two int8 forms of data.
For example, FIG. 5 illustrates a schematic diagram of a process for distortion error according to some embodiments of the application.
As shown in fig. 5, for a distortion error (int 16), where int16 represents data in the form of a high-precision 16-bit integer, it can be represented by a front (int 8) data and a rear (int 8) data, and int8 represents data in the form of a low-precision 16-bit integer. Where front=distortion error (int 16)/28 is rounded down, rear=distortion error (int 16) -front×28-27. It will be appreciated that since the first bit of data in the form of int8 is the sign bit, the value of the data in the form of int8 is only the next 7 bits, and therefore subtraction of 27 is required. In the process of calculation, distortion error (int 16) =front×28+rear+27.
It will be appreciated that the quantized data of int16 is a 16-bit binary number, and that because the first bit is a sign bit, int16 can represent a decimal range of values from-32768 to 32767. That is, the binary number of the distortion error after int16 quantization can represent decimal value X at-32768 to 32767. The range of values rounded down by dividing X by 28 is 0 to 127, which is exactly the value that an 8-bit binary number can represent (the most significant digit of an 8-bit binary number is the sign bit, so that the range of decimal numbers represented by an 8-bit binary number is-128 to 127), that is, front is in the form of int 8.
But since front is rounded down after X divided by 28. Thus, front×28 may also be smaller than X, and the portion of X-front×28 may be denoted by rear. Whereas the maximum value of X-front X28 is 255, which is greater than a value that can be represented by int8, the rear also needs to be subtracted by 27. That is, X-front X28-27 also ranges from-128 to 127, so that the rear can be represented by one int 8. For example, for integer 255, the expression int16 is 0000000001111111, the decimal number of front is 0, i.e., 255/256=0.996 rounded down to 0, and the expression int8 of front is 00000000. The decimal number of rear is 255-0 x 28-27 =127. Then the rear denoted by int8 is 01111111.
Similarly, for integer-255, the decimal number of front is 1000000001111111, namely-255/256= -0.996 is rounded down to be-1, the int8 form of front is 10000001, the decimal number of rear is-255- (-1) x 28-27 = -127, and the int8 form of rear is 11111111.
By the mode, the distortion error data quantized by the int16 can be represented by the two int8 data, so that the operation speed of the vehicle-mounted equipment can be improved, and the operation precision of the vehicle-mounted equipment can be improved.
It will be appreciated that since the number of first obstacle models is N, the number of actual obstacles may be greater or less than N. If the number of actual obstacles is smaller than N, the first spatial data of the redundant first obstacle model (i.e., the first obstacle model whose distance is not the nearest one for each actual obstacle) may not be adjusted, and the similarity of the redundant first obstacle model to the image of the actual obstacle in the forward-looking image is 0, that is, no actual obstacle corresponds to the first obstacle model. If the number of the actual obstacles is greater than N, the in-vehicle apparatus may select an actual obstacle that is close to the vehicle to correspond to the first obstacle model according to the neural network, and adjust first spatial data of the corresponding first obstacle model. To ensure that the image of the nearest obstacle of the vehicle in the first ambient image is preferentially generated.
For example, in the embodiment of the present application, the in-vehicle apparatus determines an actual obstacle in the front view image only in a space within a distance of 200m to 100m from the vehicle 10, for example, a space within 150m may be selected in front of the vehicle 10. The space on the left and right sides of the vehicle 10 and up and down to determine the actual obstacle may be within 30m to 80m from the vehicle 10, and for example, 50m may be preferable. The space behind the vehicle 10 in which the actual obstacle is determined may be a distance from the vehicle 10 in the range of 80m to 120m, for example, 100m may be preferable.
S303, selecting N1 first obstacle models from N first obstacle models according to the plurality of images and the first features.
For example, in some embodiments of the present application, the vehicle-mounted device may select N1 first obstacle models from N first obstacle models, where the similarity between the first features of the N1 first obstacle models and the features of the actual obstacle is greater than the similarity between the first features of the remaining first obstacle models and the features of the actual obstacle. Wherein N is more than 0 and less than or equal to N is more than 1. For example, in an embodiment of the present application, N1 is half of N, that is, N1 is 192 when N is 384. In other embodiments, N1 may also be other values, such as n1=n/3, or n1=n/4, n1=200, n=100, etc.
In some embodiments, N1 first obstacle models having the highest similarity to the features of the corresponding actual obstacle may be selected from the N first obstacle models. But this requires ordering the similarity of the N first obstacle models, the ordering process being computationally intensive. In other embodiments, therefore, according to the preset arrangement positions of the N first obstacle models, a first obstacle model with high similarity between the corresponding first feature and the feature of the actual obstacle may be selected from every two adjacent first obstacle models, and the first obstacle model is used as an obstacle model in the N1 first obstacle models.
For example, FIG. 6 illustrates a process of selecting N1 first obstacle models according to some embodiments of the application.
As shown in fig. 6, a first obstacle model having a high similarity between the first feature and the feature of the actual obstacle of the plurality of images may be selected from every two adjacent first obstacle models. It can be appreciated that since the positions of the N first obstacle models are already aligned at first, queuing of the N first obstacle models is not required in the process of selecting the N1 first obstacle models, so that the efficiency of generating the first environment image by the vehicle-mounted device is improved. In other embodiments, the similarity between the N first obstacle models and the features of the actual obstacle in the image may be ranked, and then the first N1 first obstacle models with the highest similarity are selected, where the ranking process may occupy a higher computing power of the vehicle device.
And S304, optimizing the first spatial data of the N1 first obstacle models and the second spatial data of the N2 historical obstacle models to obtain third spatial data of the N1 first obstacle models and the N2 historical obstacle models.
Illustratively, in some embodiments of the present application, n=n1+n2, that is, the obstacle models that generate the first environmental image are always N (hereinafter, the obstacle models of N1 first obstacle models and N2 historical obstacle models are represented by N target obstacle models). It is understood that the third spatial data includes spatial data obtained by performing a plurality of optimizations on the first spatial data of the N1 first obstacle models, and spatial data obtained by performing a plurality of optimizations on the second spatial data of the N2 historical obstacle models. After the first spatial data and the second spatial data are optimized for multiple times, the positions of the corresponding M feature points also change, for example, each third spatial data includes second position data of the M feature points corresponding to the target obstacle model.
For example, in the embodiment of the present application, the vehicle-mounted device may further acquire the historical obstacle model as a reference in the process of generating the first environmental data of the current frame. For example, the vehicle-mounted device includes third features of N third obstacle models corresponding to the historically generated second environment image, and fourth spatial data of the N third obstacle models, where the similarity between the third features and the image features of the actual obstacle corresponding to the second environment image is greater than a set threshold, and the third features are obtained by sampling the fourth spatial data of the N third obstacle models from the image features of the actual obstacle corresponding to the second environment image.
The in-vehicle apparatus may select N2 third features having the highest feature similarity of the image of the actual obstacle corresponding to the second environment image from among the third features of the N third obstacle models.
Then, the in-vehicle apparatus may use fourth spatial data of the third obstacle model corresponding to the N2 third features as second spatial data of the N2 history obstacle models. The first spatial data of the N1 first obstacle models and the second spatial data of the N2 historical obstacle models are hereinafter referred to as target spatial data. That is, after the target space data of the N target obstacle models are optimized for a plurality of times, third space data of the target obstacle models can be obtained, where the third space data includes second position data of M feature points corresponding to the target obstacle models.
In some embodiments of the application, the second ambient image may be an ambient image generated by the vehicle 10 at a frame previous to the current frame. After determining that each third obstacle model corresponds to the actual obstacle image of the previous frame (i.e., the actual obstacle image corresponding to the second environment image), the vehicle-mounted device may acquire the speed of each third obstacle model relative to the vehicle 10, and calculate the position of each third obstacle model in the current frame according to the interval time of each frame. That is, the fourth spatial data may also be spatial data after calculation of the speed of each third obstacle model relative to the vehicle 10 and the time interval relative to the current frame.
In some embodiments of the present application, the vehicle-mounted device may, for example, perform at least one adjustment on the positions of the plurality of feature points of the N1 first obstacle models and the positions of the plurality of feature points of the N2 historical obstacle models according to the spatial data of the actual obstacle in the plurality of images, so as to obtain the third spatial data.
It is understood that, after obtaining the target spatial data of the target obstacle model, the in-vehicle apparatus may perform multiple adjustments (for example, three adjustments) on the target spatial data of the target obstacle model according to the spatial data of the actual obstacle model in the multiple images.
For example, taking a front view camera as an example, the vehicle device may project the target spatial data of the N target obstacle models into an image collected by the front view camera (hereinafter referred to as a front view image), and collect the target features of the N target obstacle models at the front view camera, and it may be understood that since the N target obstacle models include N2 historical obstacle models, resampling of the N target obstacle models is required to obtain the target features.
The vehicle-mounted device may determine each actual obstacle in the front view image using the neural network model, and referring to fig. 1B, the actual obstacles in the front view image may include the pedestrian 02, the second vehicle 04, and the like. After the vehicle-mounted device determines the actual obstacle, the target obstacle model closest to the pedestrian 02 can be corresponding to the pedestrian 02 through the neural network model. Similarly, a target obstacle model closest to the second vehicle 04 may be associated with the second vehicle 04.
In some embodiments, the vehicle-mounted device may adjust the target space data of the target obstacle model corresponding to the pedestrian 02 through the space size, the position, the rotation angle and the like, so as to update the target space data of the target obstacle. The adjustment process is, for example, to acquire the offset of the spatial size, position and rotation angle of the pedestrian 02 and the first obstacle corresponding thereto, then add the offset to the spatial size, position and rotation angle of the target obstacle model, and then adjust the first positions of the plurality of feature points of the target obstacle model according to the spatial size, position and rotation angle of the added target obstacle model so as to update the target spatial data of the target obstacle model corresponding to the pedestrian 02, that is, the position information of the corresponding M feature points in the target spatial data is updated.
Similarly, in the above manner, each of the actual obstacle in the plurality of pictures may be corresponded to the N target obstacle models, and the target space data of the N target obstacle models may be updated. After the above-described optimization a number of times (e.g., 2 times, 3 times, 4 times, 5 times, etc.), the target spatial data of the N target obstacle models may be updated to the third spatial data.
S305, sampling the characteristics of the multiple images according to the third space data to obtain second characteristics of the target obstacle model, and generating a first environment image according to the second characteristics.
For example, after determining the third spatial data of the target obstacle model, the features of the plurality of images may be sampled according to the second position data of each feature point in the third spatial data to obtain the second feature of the target obstacle model. The sampling process may refer to the process of sampling the image features of the plurality of images according to the first spatial data with reference to the first obstacle model in S302 to obtain the first feature. It will be appreciated that the similarity of the second feature of the target obstacle model to the features of the actual obstacle is greater than the set threshold. The set threshold may be, for example, any value between 60% and 95%.
For example, in some embodiments of the present application, after determining the second characteristic of the target obstacle, the in-vehicle apparatus may generate the first environment image according to the second characteristic.
In some embodiments, since some of the target obstacle models do not correspond to the actual obstacle models, the second features of the target obstacle models may have a smaller similarity to the features of the corresponding actual obstacle in the plurality of images, and may be smaller than the set threshold, but the vehicle-mounted device may still retain the target obstacle models, and in the process of generating the first environment image based on the second features, the target obstacle models having the second features of the target obstacle models with the similarity to the features of the actual obstacle smaller than the set threshold may be discarded, that is, only the second features of the target obstacle models having the second features with the similarity to the features of the actual obstacle exceeding the set threshold are used to generate the first environment image.
It will be appreciated that, since the second feature is obtained by sampling the features of the plurality of images by M feature points of the target obstacle, the number of M may be predetermined, and the value of M may be as long as the spatial size, position, and rotation angle of the corresponding target obstacle can be expressed, so that the value of M may be set smaller, for example, may be 13 in the embodiment of the present application, may be 7 or 19 in other embodiments, or the like. In this way, the data amount of the second feature can be greatly reduced, thereby reducing the computational power demand of the in-vehicle apparatus. And, the first environmental image generated by the vehicle-mounted device based on the second feature has three-dimensional properties, for example, the spatial size, position and rotation angle of each third obstacle can be determined, so that the information loss of the obstacle (for example, the height information of the obstacle is lost in the BEV space) is avoided. And thus have a better visual effect than BEV maps. Also, the vehicle 10 can display information of the obstacle in the ascending and descending slope in the first environmental image at the time of the slope traveling, thereby improving the accuracy of the in-vehicle apparatus detecting the obstacle.
Next, a process of generating the first environment image by the in-vehicle apparatus is described.
For example, fig. 7 illustrates a schematic diagram of an in-vehicle device generating a first environmental image, according to some embodiments of the application.
For example, after obtaining the plurality of images of the current frame, the vehicle-mounted device may map the first spatial data of the N first obstacle models to a coordinate space of the plurality of images, and sample the plurality of images based on the first positions of the M feature points, thereby obtaining the first features of the N first obstacle models, and the process of obtaining the first features of the N first obstacle models may refer to the flow of S302.
After the vehicle-mounted device obtains the first features of the N first obstacle models, N1 first obstacle models with the highest similarity between the first features and the actual obstacle can be selected, and it can be understood that the actual obstacle is an obstacle in a plurality of images. After the N1 first obstacle models are selected, first spatial data and first features of the N1 first obstacle models may be determined. The process of determining the first spatial data and the first features of the N1 first obstacle models may refer to the flow of S303. Then, the vehicle-mounted device may further acquire third features and second spatial data of the N2 historical obstacle models, combine the second spatial data of the N2 historical obstacle models and the first spatial data of the N1 first obstacle models into target spatial data of the N target obstacle models, and combine the first features of the N1 first obstacle models and the third features of the N2 historical obstacle models into target features of the N target obstacle models, thereby acquiring target spatial data and target features of the N target obstacle models. The process of acquiring the second spatial data and the third characteristic of the N2 history obstacles may refer to the flow of S304.
In the present embodiment, the process of optimizing the target space data of the target obstacle model by the in-vehicle apparatus is, for example, to sample the features of the plurality of images based on the target space data of the N target obstacle models to update the target features of the N target obstacle models. It will be appreciated that since the target obstacle model includes the second spatial data and the third features of the N2 historical obstacle models, the features of the N target obstacle models need to be updated, and the original target features of the N target obstacle models may also determine the features of the more important target obstacle through the self-attention module.
For example, the target features of the N target obstacle models (e.g., target features that have not been resampled by the target spatial data) may be subjected to an inter-frame interaction to obtain inter-frame interaction features, where the inter-frame interaction refers to, for example, interaction between a first feature of the N1 first obstacle models of the current frame in the target obstacle model and a third feature of the N2 historical obstacle models of the previous frame.
For example, FIG. 8A illustrates a schematic diagram of an inter-frame interaction, according to some embodiments of the application.
As shown in fig. 8A, a self-attention module is configured in the in-vehicle apparatus, and the self-attention module can perform self-attention optimization by a conversion model (transducer).
For example, the process of inter-frame interaction may be to multiply the target features of the N target obstacle models, i.e., the first features of the N1 first obstacles and the third features of the N2 historical obstacles, with the query weight matrix in the transducer model, thereby obtaining the query features.
The third features of the N2 historical obstacle models are then multiplied by the key weight matrix in the transducer model to obtain key features.
Similarly, the third feature of the N2 historical obstacle models is multiplied by the value weight matrix in the transducer model to obtain the value feature.
Illustratively, in the model of the transducer, key features are used to match the query features to determine which features are most important to the features that generate the first environmental image. The value feature contains the actual information corresponding to the key feature, which is weighted and summed in the attention mechanism to generate the feature of the final first ambient image.
Illustratively, the query weight matrix, key weight matrix, and value weight matrix are trainable parameters of the transducer model, which are updated and optimized by the back propagation algorithm during model training. Therefore, after the corresponding weight matrix is trained, the corresponding query feature, key feature and value feature can be obtained only by multiplying the input feature data with the corresponding weight matrix.
After obtaining the query feature, the key feature, and the value feature, the input features may be optimized based on the self-attention module to obtain corresponding outputs, and the optimization process of the self-attention module is described in detail below. For example, the output result of an inter-frame interaction may be referred to as an inter-frame interaction feature. It will be appreciated that the N inter-frame interaction features also include the inter-frame interaction features of the N1 first obstacle models, and the inter-frame interaction features of the N2 historical obstacle models.
After finishing the inter-frame interaction, the vehicle-mounted device can perform the intra-frame interaction on the features of the N target obstacle models after the inter-frame interaction, so as to obtain the features after the intra-frame interaction.
For example, fig. 8B illustrates a schematic diagram of an inter-frame interaction, according to some embodiments of the application.
As shown in fig. 8B, the input of the intra-frame interaction may be the features after the inter-frame interaction of the N target obstacle models, and the interaction process is, for example, to input the features after the inter-frame interaction of the N target obstacle models into the query weight matrix, the key weight matrix and the value weight matrix of the transform model respectively, so as to obtain the corresponding query feature, the key feature and the value feature, where it can be understood that the features after the inter-frame interaction of the N target obstacle models include the features after the inter-frame interaction of the N1 first obstacle models and the features after the inter-frame interaction of the N2 historical obstacle models. The vehicle-mounted device then optimizes the query feature, the key feature and the value feature according to the self-attention-based module to obtain corresponding outputs, and an optimization process of the self-attention-based module is described in detail below.
It can be understood that after the intra-frame interaction is completed, the intra-frame interaction features of the N target obstacle models can be obtained, and then the vehicle-mounted device can perform feature fusion on the target features resampled by the target obstacle models and the features after the intra-frame interaction of the target obstacle models, so as to obtain updated target features. For example, in an embodiment of the present application, the resampled target features of the N target obstacle models may be added to the intra-interacted features of the corresponding N target obstacle models to obtain new features of the N target obstacle models, which may be used as updated target features of the N target obstacle models. It will be appreciated that since the target features of the N target obstacle models include the first features of the N1 first obstacle models and the third features of the N2 historical obstacle models, after the target features of the N target obstacle models are updated, the corresponding first features of the N1 first obstacle models and the third features of the N2 historical obstacle models are updated as well.
Next, the optimization process of the self-attention module is described.
For example, FIG. 8C illustrates a schematic diagram of an optimization process for a self-attention module, according to some embodiments of the application.
As shown in fig. 8C, after the vehicle-mounted device obtains the corresponding query feature, key feature, and value feature, a matrix multiplication operation (e.g., dot product operation) may be performed on each query feature and all key features to obtain the attention score. This score reflects the degree of similarity or association between the query feature and each key feature.
The vehicle device then normalizes the attention score again, e.g., the attention score may be normalized by a softmax function such that the sum of all the attention scores is 1. Thus, each attention score represents the weight of the corresponding value feature.
Then, the normalized attention score is assigned as a weight to the corresponding value feature, i.e., the attention score is multiplied as a weight by each value feature by matrix multiplication. This means that the value feature corresponding to the key feature with higher similarity to the query feature will get a larger weight. Then, the results of the matrix multiplication of all value features and their corresponding weights are summed, that is, the query features are weighted and summed with the self-attention score as a weight. This weighted sum is the output of the self-attention mechanism.
It will be appreciated that the optimization process of the self-attention module in the inter-frame interaction and intra-frame interaction processes described above may refer to the process of fig. 8C.
After the updated target features of the N target obstacle models are obtained, the vehicle-mounted device may further readjust the target spatial data of the N target obstacle models based on the spatial data of the plurality of images, and the adjustment process refers to the flow of S304, so as to update the target spatial data of the N target obstacle models. It will be appreciated that so far, both the target space data and the target feature data for the N target obstacle models have been updated. In order to ensure the accuracy of the generated first environment image, the vehicle-mounted device may further update the target space data of the N target obstacle models multiple times based on the neural network to obtain third space data of the N target obstacle models, and similarly, the target features of the N target obstacle models may also be circularly optimized multiple times. For example, in some embodiments of the application, the third spatial data is obtained by performing 2-cycle optimization adjustments to the target spatial data.
For example, the vehicle-mounted device may perform cyclic optimization on the N target obstacle models after updating, and the target features for multiple times, where the cyclic optimization process is, for example, the process of performing inter-frame interaction and intra-frame interaction on the target features, projecting the target space data onto the coordinate space of the multiple images, resampling the N target obstacle models to obtain N target obstacle model resampled features, adding and fusing the N target obstacle model resampled features with the intra-frame interacted features to update the target features, and performing further optimization on the target space data based on the updated target features to update the target space data.
In some embodiments of the present application, in the last cycle of the vehicle-mounted device, the feature obtained after the intra-frame interaction may be a fourth feature, the feature obtained by projecting the target spatial data into the coordinate space of the plurality of images and resampling the target obstacle model may be a second feature, and it may be understood that the target spatial data is the spatial data updated in the last cycle of the vehicle-mounted device, and the target spatial data may be used as the third spatial data, that is, the second feature is obtained by sampling in the coordinate space of the plurality of images based on the third spatial data. In the last cycle, the vehicle-mounted device may generate the first environment image based on the fifth feature by adding the feature obtained after the fusion of the second feature and the fourth feature as the fifth feature.
By the above-described loop optimization process, the accuracy of generating the first environmental image can be improved.
The vehicle-mounted apparatus referred to in the above respective embodiments is described below.
For example, fig. 9 shows a schematic structural diagram of an in-vehicle apparatus 100 according to some embodiments of the present application.
The in-vehicle apparatus 100 may be used to implement the image processing method provided by the foregoing embodiments.
As shown in fig. 9, the in-vehicle device 100 includes one or more processors 101, a system memory 102, a non-volatile memory (NVM) 103, a communication interface 104, an input/output device 105, and a system control logic unit 106 for coupling the processor 101, the system memory 102, the non-volatile memory 103, the communication interface 104, and the input/output device 105. Wherein:
The processor 101 may include one or more processing units, e.g., processing modules or processing circuits that may include a central processing unit (central processing unit, CPU), an image processor (graphics processing unit, GPU), a digital signal processor (DIGITAL SIGNAL processor, DSP), a microprocessor (micro-programmed control unit, MCU), an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or a programmable logic device (field programmable GATE ARRAY, FPGA), a neural network processor (neural-network processing unit, NPU), etc., may include one or more single-core or multi-core processors. In some embodiments, the CPU may be configured to optimize the neural network model to be run, e.g., in some embodiments of the present application, the neural network model may optimize spatial data of the first obstacle model, and the NPU may be configured to run the neural network model to be run.
The system memory 102 is a volatile memory such as random-access memory (RAM), double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), or the like. The system memory is used to temporarily store data and/or instructions, for example, in some embodiments, the system memory 102 may be used to store data provided by different services, such as sensor data, image data, or video data, etc., and may also be used to store instructions of the image processing methods provided by the foregoing embodiments, etc.
Nonvolatile memory 103 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the non-volatile memory 103 may include any suitable non-volatile memory, such as flash memory, and/or any suitable non-volatile storage device, such as a hard disk drive (HARD DISK DRIVE, HDD), compact Disc (CD), digital versatile disc (DIGITAL VERSATILE DISC, DVD), solid state disk (solid-state disk STATE DRIVE, SSD), and the like. In some embodiments, the nonvolatile memory 103 may also be a removable storage medium, such as a Secure Digital (SD) memory card or the like. In other embodiments, the nonvolatile memory 103 may be used to store instructions or the like of the image processing method provided in the foregoing embodiments.
In particular, system memory 102 and non-volatile storage 103 may include temporary and permanent copies of instructions 107, respectively. The instructions 107 may include instructions that, when executed by at least one of the processors 101, cause the in-vehicle apparatus 100 to implement the image processing method provided by the embodiments of the present application.
The communication interface 104 may include a transceiver to provide a wired or wireless communication interface for the in-vehicle device 100 to communicate with any other suitable device via one or more networks. In some embodiments, the communication interface 104 may be integrated with other components of the in-vehicle device 100, e.g., the communication interface 104 may be integrated in the processor 101. In some embodiments, the in-vehicle device 100 may communicate with other devices through the communication interface 104, e.g., the in-vehicle device 100 may obtain corresponding data from other devices through the communication interface 104, etc.
The input/output device 105 may be an input device such as a keyboard, a mouse, etc., an output device such as a display, etc., and a user may interact with the in-vehicle device 100 through the input/output device 105.
The system control logic 106 may include any suitable interface controller to provide any suitable interface with other modules of the in-vehicle device 100. For example, in some embodiments, the system control logic 106 may include one or more memory controllers to provide an interface to the system memory 102 and the non-volatile memory 103.
In some embodiments, at least one of the processors 101 may be packaged together with logic for one or more controllers of the system control logic unit 106 to form a system package (SYSTEM IN PACKAGE, siP). In other embodiments, at least one of the processors 101 may also be integrated on the same chip with logic for one or more controllers of the system control logic unit 106 to form a system-on-chip (SoC).
It is to be understood that the configuration of the in-vehicle apparatus 100 shown in fig. 9 is only one example, and in other embodiments, the in-vehicle apparatus 100 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
It is understood that the in-vehicle device 100 may be any device configured on a vehicle, including but not limited to a cell phone, a car set, a terminal in an unmanned (SELF DRIVING), a wireless terminal in transportation security (transportation safety), a terminal in a smart city (SMART CITY), and the like.
The embodiment of the application also provides a program product which can enable the vehicle-mounted equipment to realize the image processing method provided by the previous embodiments when being executed on the vehicle-mounted equipment.
The embodiment of the application also provides a readable storage medium, in which one or more programs are stored, which when executed by the vehicle-mounted device, cause the vehicle-mounted device to implement the image processing method provided in the foregoing embodiments.
Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For the purposes of this disclosure, a processing system includes any system having a processor such as, for example, a digital signal processor, a microcontroller, an application specific integrated circuit, or a microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.
In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, compact-read only memories (CD-ROMs), magneto-optical disks, read Only Memories (ROMs), random-access memories (RAMs), erasable programmable read-only memories (erasable programmable read only memory, EPROMs), electrically erasable programmable read-only memories (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memories, EEPROMs), magnetic or optical cards, flash memory, or tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.
It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.
It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the application.