METHOD AND DEVICE FOR GENERATING AN OUTSIDER PERSPECTIVE
IMAGE
TECHNICAL FIELD
[0001] The disclosure relates to image generation. Specifically, the disclosure relates to method and device for generating an outsider perspective image that incorporates edits based on text prompts.
BACKGROUND
[0002] Modern vehicles generally have larger blind spots for drivers as compared to older vehicles due to their design and shape. As a result, the traditional rear-view and side mirrors only provide a limited view of areas surrounding the vehicle, and the driver may miss seeing objects such as pedestrians, cyclists, or other vehicles that may be located in blind spots, which may lead to an increased risk of collision or accidents. There is therefore a rising need for surround view monitoring which may help the driver have better situational awareness, thereby potentially preventing accidents caused by collisions with objects in blind spots.
[0003] Existing 360-degree surround view systems suffer from image distortion and may not capture certain blind spots or obscured areas, potentially leading to safety concerns. Such existing systems use multiple cameras to capture images from different angles to create a single panoramic image by stitching the captured images together, potentially resulting in artifacts such as ghosting, blurring, or distortion, which may make it difficult for a driver to interpret the generated image. In addition, the quality of images captured by sensors may also be a limiting factor. For example, low light conditions or adverse weather or environmental conditions, such as rain or snow, may lead to images of reduced quality or visibility, such that a driver may find the resulting images presented hard to see or interpret.
[0004] Furthermore, drivers often interact with existing 360-degree surround view systems by swiping the display screen to get different views, which may lead to the driver losing control of the vehicle due to frequent interactions with the system. In addition, the driver may potentially miss important information about their surroundings if the appropriate viewpoint is not selected, or the rendered image has information missing, which may lead to accidents or near misses.
SUMMARY
[0005] It is the object of the disclosure to have an improved method and device for generating outsider perspective images that incorporate edits based on text prompts, especially to drivers, to alleviate the above-described problems.
[0006] The object is achieved by the subject matter of the independent claims. Preferred embodiments are subject-matter of the dependent claims.
[0007] It shall be noted that all embodiments of the present disclosure concerning a method or a series of performed steps might be carried out with the order of the steps as described, nevertheless this has not to be the only and essential order of the steps of the method. The herein presented methods or series of performed steps can be carried out with another order of the disclosed steps without departing from the respective method embodiment, unless explicitly mentioned to the contrary hereinafter.
[0008] To solve the above technical problems, the present disclosure provides a computer-implemented method for generating an outsider perspective image, the method comprising: receiving a plurality of images with depth information that capture surroundings of an object; representing the surroundings of the object as a continuous function based on the received plurality of images with depth information; refining the continuous function based on one or more text prompts; and generating an outsider perspective image based on the refined continuous function and a desired viewpoint.
[0009] The method of the present disclosure is advantageous over known methods as representing the surroundings of the object as a continuous function allows the reconstruction of detailed scene geometry, even with complex occlusions, which may result in higher quality outsider perspective images generated with reduced artifacts. The method is also advantageous as the refining of the continuous function using machine learning model further removes any unwanted or occluding features, thereby potentially resulting in higher quality outsider perspective images generated with reduced unwanted or occluding features. The usage of text prompts also allows a user to easily interact with and refine the continuous function representative of the scene with minimal interactions and visual attention, thereby potentially reducing the number of user interactions required and potentially increasing user safety in applications such as automotive applications.
[0010] A preferred method of the present disclosure is a method as described above, wherein refining the continuous function based on one or more text prompt comprises using a conditioned diffusion model.
[0011] The above-described aspect of the present disclosure has the advantage that using a conditioned diffusion model with instructional edits to refine the continuous scene allows an intuitive and content-aware 3D editing on the continuous function, which may allow for photo realistic rendering of the scene surrounding the object with significantly improved image quality for ease of viewing and interpretation by a user or driver.
[0012] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the one or more text prompts are from one or more of the following categories: addition or reconstruction of the object; scene refinement; and/or stylization.
[0013] The above-described aspect of the present disclosure has the advantage that undesirable features are removed from the continuous function representative of the scene surrounding the object, thereby potentially allowing photo realistic rendering of the scene surrounding the object with significantly improved image quality and less undesirable features that may make the generated outsider perspective image difficult to interpret or view.
[0014] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the one or more text prompts are selected from a predefined list of editing instructions which are applied sequentially, or are generated based on voice input from a user which is converted to text using a speech-to-text model.
[0015] The above-described aspect of the present disclosure has the advantage that the methods of input of the one or more text prompts require minimal visual attention from a user (e.g., using predefined list or using vocal input), thereby potentially reducing the need for user interaction and input to potentially increase user safety when used in an automotive application as it reduces the duration that a driver has to take their eyes off the road.
[0016] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein refining the continuous function comprises using mask free techniques.
[0017] The above-described aspect of the present disclosure has the advantage that using mask free techniques reduces the need for user interaction and input, thereby potentially increasing user safety when used in an automotive application as it reduces the number of inputs and interactions required from the driver.
[0018] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the desired viewpoint is predicted using a viewpoint prediction model configured to receive as input a bird-eye view image rendered using the refined continuous function; and optionally, additional inputs comprising one or more of the following: previously predicted desired viewpoints; previously generated bird' s-eye view images; a lidar signal; a radar signal; a turn or blinker signal; a steering wheel position signal; and/or a gear position signal.
[0019] The above-described aspect of the present disclosure has the advantage that the usage of a viewpoint prediction model allows the prediction of a potentially more optimal viewpoint based on the scene context, thereby potentially allowing the generation of a more optimal perspective image which may reduce the need for user input or interaction. The additional inputs may also be provided as input to allow the prediction of an even more optimal viewpoint as the additional inputs provide more information on the scene context.
[0020] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the viewpoint prediction model is trained using reinforcement learning techniques, and preferably model-free reinforcement learning techniques.
[0021] The above-described aspect of the present disclosure has the advantage that reinforcement learning techniques do not require the specific collection of a training dataset.
In addition, model-free reinforcement learning techniques may be preferred as the policy network may be able to learn directly from experience without the need for a model of the environment, and may computationally less intensive as compared to reinforcement learning techniques with models. In addition, certain model-free reinforcement learning techniques such as DQN-like algorithms can better handle high-dimensional state spaces, which may make it advantageous in image-based environments.
[0022] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the viewpoint prediction model uses a regression approach, and preferably using a multi-modal fusion approach.
[0023] The above-described aspect of the present disclosure has the advantage that a regression approach allows the prediction of continuous values for the desired viewpoint coordinates (x, y, z, roll, pitch, yaw), which is a more dynamic approach which may potentially result in the prediction of a more accurate or optimal viewpoint as compared to predefined viewpoints which may miss or encompass only a small subset of all possible viewpoints.
[0024] A preferred method of the present disclosure is a method as described above or as described above as preferred, wherein the viewpoint prediction model uses a classification approach.
[0025] The above-described aspect of the present disclosure has the advantage that a classification approach ensures that the predicted results fall within the predefined viewpoints, thereby potentially minimizing the risk of obtaining unrealistic or imaginary viewpoints.
[0026] A preferred method of the present disclosure is a method as described above or as described above as preferred, further comprising refining the viewpoint prediction model, comprising: presenting the generated outsider perspective image on a display; receiving one or more interactions from a user adjusting the presented image; determining an adjusted viewpoint based on the adjusted image; and refining the viewpoint prediction model based on the adjusted viewpoint.
[0027] The above-described advantageous aspects of a computer-implemented method of the present disclosure also hold for all aspects of a below-described device of the present disclosure. All below-described advantageous aspects of a device of the present disclosure also hold for all aspects of an above-described computer-implemented method of the present disclosure.
[0028] The present disclosure also relates to a device for generating an outsider perspective image, the device comprising: a processor configured to perform the computer-implemented method of any of the preceding claims to generate an outsider perspective image; and a display configured to present the generated outsider perspective image.
[0029] A preferred device of the present disclosure is a device as described above, further comprising a plurality of sensors positioned to capture a plurality of images representative of a 360-degree view surrounding the object.
[0030] A preferred device of the present disclosure is a device as described above or as described above as preferred, wherein the object includes the vehicle.
[0031] A preferred device of the present disclosure is a device as described above or as described above as preferred, wherein the plurality of sensors are placed on a rooftop, front, and rear section of the vehicle, wherein the plurality of sensors comprises: a first plurality of sensors placed along a perimeter of the rooftop section of the vehicle, the first plurality of sensors having: a distance of between 0.05 to 0.3 m between sensors, preferably a distance of between 0.2 to 0.3 m between sensors positioned at sides of the rooftop and a distance of between 0.05 to 0.1 m between sensors positioned at corners of the rooftop; heights of between 0 to 10 cm from a rooftop surface of the vehicle, wherein the sensors preferably have the same height; pitch angles between -20 and -25 degrees; and yaw angles between 0 and 360 degrees; a second plurality of sensors placed on a front section of the vehicle, the second plurality of sensors positioned within an area between headlights and a front bumper of the vehicle and having: a distance of between 0.1 to 0.2 m between sensors; pitch angles between -20 and -25 degrees; and yaw angles between 0 and 30 degrees; and a third plurality of sensors placed on a rear section of the vehicle, the third plurality of sensors positioned between with an area between brake lights and a rear bumper of the vehicle and having: a distance of between 0.1 to 0.2 m between sensors; pitch angles between -20 and -25 degrees; and yaw angles between 180 and 220 degrees.
[0032] The above-described aspect of the present disclosure has the advantage that the disclosed arrangement of sensors facilitates the complete capture of a 360-degree scene around the ego vehicle, to eliminate or minimise any blind spots. The disclosed arrangement also ensures that the images captured provide an ample degree of overlap between images, to facilitate the generation of a potentially seamless continuous function representative of the scene without any unwanted artifacts In addition, the disclosed arrangement also provides potentially more photorealistic rendering for the continuous function in both inward and outward directions from an ego vehicle.
[0033] The above-described advantageous aspects of a computer-implemented method or device of the present disclosure also hold for all aspects of a below-described use of a computer-implemented method or a device of the present disclosure. All below-described advantageous aspects of a use of a computer-implemented method or device of the present disclosure also hold for all aspects of an above-described computer-implemented method or
device of the present disclosure.
[0034] The present disclosure also relates to use of a computer-implemented method of device of the present disclosure for the display of a generated outsider perspective image.
[0035] The above-described advantageous aspects of a computer-implemented method, device, or use of a computer-implemented method or a device of the present disclosure also hold for all aspects of a below-described computer program, a machine-readable storage medium, or a data carrier signal of the present disclosure. All below-described advantageous aspects of a computer program, a machine-readable storage medium, or a data carrier signal of the present disclosure also hold for all aspects of an above-described computer-implemented method, device, or use of a computer-implemented method or a device of the
present disclosure.
[0036] The present disclosure also relates to a computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a processor, cause the processor to perform the steps of a computer-implemented method according to the present disclosure.
[0037] In some embodiments, the present disclosure provides a multi-view camera setup to capture a 360-degree scene around an object, such as a vehicle. The series of multi-view camera images may be used to represent the scene as a continuous 3-dimensional (3D) function. The continuous 3-dimensional function may be generated using neural radiance field (NeRF) method for the reconstruction of detailed scene geometry, even with complex occlusions, ultimately resulting in reduced artifacts. Instructional edits may be performed on the 3D function to accomplish a wide variety of edits on people, objects, and large scenes to generate images with improved image quality. Images from a desired viewpoint may be generated using the 3D function. The desired viewpoint may be an appropriate viewpoint predicted automatically based on a surrounding scene context, to minimise user interaction with the screen while driving. Reinforcement learning may also be incorporated to further train or personalise the prediction of appropriate viewpoint.
[0038] In some embodiments, the methods disclosed in the present disclosure may be used and/or adapted for use in a variety of applications. For example, the methods may be useful for the generation of outsider perspective imaged in the automotive field. For example, the methods may be useful in virtual reality applications, with 3D reconstruction used to create a virtual environment that closely resembles the real world, with a viewpoint prediction model used to identify a best viewpoint for the user to experience the virtual world while accounting for factors such as the user's position, orientation, and current task. For example, the methods may be useful in robotic applications, with the 3D reconstruction and viewpoint prediction model used to create virtual models of real-world environments, which may then be used to plan robot paths and trajectories.
BRIEF DESCRIPTION OF THE DRAWINGS
[0039] These and other features, aspects, and advantages will become better understood with regard to the following description, appended claims, and accompanying drawings where: [0040] Fig. 1 is a schematic illustration of a computer-implemented method for generating an outsider perspective image, in accordance with embodiments of the present disclosure; [0041] Fig. 2 is a schematic illustration of a method of predicting a desired viewpoint using a viewpoint prediction model trained using reinforcement learning techniques, in accordance with embodiments of the present disclosure; [0042] Fig. 3 is a schematic illustration of a method of predicting a desired viewpoint using a classification approach, in accordance with embodiments of the present disclosure; [0043] Fig. 4 is a schematic illustration of a method of predicting a desired viewpoint using a classification approach incorporating additional inputs, in accordance with embodiments of the present disclosure; [0044] Fig. 5 is a schematic illustration of a method of predicting a desired viewpoint using a regression approach, in accordance with embodiments of the present disclosure; [0045] Fig. 6 is a schematic illustration of a method of refining a viewpoint prediction model, according to embodiments of the present disclosure; [0046] Fig. 7 is a schematic illustration of a device for implementing some embodiments of the present disclosure; and [0047] Fig. 8 is a schematic illustration of an arrangement of a plurality of sensors placed on a vehicle, in accordance with embodiments of the present disclosure.
[0048] In the drawings, like parts are denoted by like reference numerals.
[0049] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0050] In the summary above, in this description, in the claims below, and in the accompanying drawings, reference is made to particular features (including method steps) of the disclosure. It is to be understood that the disclosure in this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment of the disclosure, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments of the disclosure, and in the disclosure generally.
[0051] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily be construed as preferred or advantageous over other embodiments.
[0052] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.
[0053] Fig. 1 is a schematic illustration of a computer-implemented method 100 for generating an outsider perspective image, in accordance with embodiments of the present disclosure. Method 100 for generating an outsider perspective image may be used in and/or adapted for use in a wide variety of applications, including automotive, virtual reality and/or robotic applications.
[0054] According to some embodiments, method 100 may comprise step S2 wherein a plurality of images 104 with depth information is received. The plurality of images 104 with depth information captures the surroundings of an object. The object may be any object, including a vehicle, a robot, or a human. In some embodiments, the plurality of images 104 with depth information may be captured by a plurality of sensors comprising both light and depth sensors, such as RGBD sensors. In some embodiments, the plurality of images 104 with depth information may comprise images captured by light sensors (e.g., RGB sensors) combined with depth images captured by depth sensors (e.g., lidar sensors) that are calibrated and synchronised with the light sensors. In some embodiments, the plurality of images 104 with depth information may comprise images captured by light sensors (e.g., RGB sensors), with the depth information determined based on the images captured by light sensors using any known algorithms or models for monocular depth estimation, such as the MiDas model available at https://github.com/isl-org/MiDaS and described in "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer" by Ranftl et. al. (arXiv:1907.01341).
[0055] According to some embodiments, method 100 may comprise step S4 wherein the surroundings of the object is represented as a continuous function 108 based on the plurality of images 104 with depth information. In some embodiments, continuous function 108 may be a neural radiance field (NeRF) as disclosed and described in detail in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" by Mildenhall et.
al. (arXiv:2003.08934). In general, NeRFs are fully connected neural networks that can generate novel views of complex three-dimensional (3D) scenes based on a partial set of input two-dimensional (2D) images of a scene captured from various viewpoints. A neural network generally comprises an input layer comprising one or more input nodes, one or more hidden layers each comprising one or more hidden nodes, and an output layer comprising one or more output nodes. A fully connected neural network, also known as a multilayer perceptron (MLP), is a type of neural network comprising a series of fully connected layers that connect every neuron in one layer to every neuron in the preceding and subsequent layer. In general, NeRF is capable of representing a scene via values of parameters of a fully connected neural network. In some embodiments, NeRFs are trained to use a rendering loss to reproduce input views of a scene and work by taking multiple input images representing a scene and interpolating between the multiple input scenes to render the complete scene. A NeRF network may be trained to map directly from viewing direction and location (5D input) to density and colour (4D output), using volume rendering to render new views. In particular, a continuous scene is represented as a 5D vector-valued function whose input is a 3D location x = (x, y, z) and 2D viewing direction (0, co), and whose output is an emitted colour c -(r, g, 19 and volume density a. In some embodiments, a NeRF may comprise eight fully connected ReLU layers, each with 256 channels, followed by sigmoid activation into an additional layer that outputs the volume density a and a 256-dimensional feature vector which is concatenated with the positional encoding of the input viewing direction to be processed by an additional fully connected ReLU layer with 128 channels and passed onto a final layer with a sigmoid activation which outputs the emitted RGB radiance at position x as viewed by a ray with direction d. In some embodiments, the positional encoding of the input location (y(x)) may be input into NeRF into the first layer of the NeRF, as well as through a skip connection that concatenates the input location into the fifth layer's activation.
[0056] According to some embodiments, method 100 may comprise step 56 wherein the continuous function 108 may be refined based on one or more text prompts 112 to generate a refined continuous function 116. In some embodiments, step S6 of refining the continuous function 108 may comprise using mask-guided techniques which rely on object-level supervision to constrain regions that may be modified, or mask-free techniques which can address image retouching or refinement without mask guidance, and preferably mask-free techniques. In some embodiments, continuous function 108 may be refined based on one or more text prompts 112 using a machine learning model, in particular a conditioned diffusion model such as lnstructPix2Pix as disclosed in "lnstructpix2pix: Learning to follow image editing instructions" by Brooks et. at (arXiv:2211.09800).
[0057] According to some embodiments, step S6 may comprise receiving the continuous function 108 along with its corresponding source data such as the plurality of images 104 with or without depth information, their corresponding sensor poses, and senor calibration information (which may typically be from a structure-from-motion system such as COLMAP) as well as one or more text prompts 112 which may be a natural language editing instruction. An example of a text prompt may be "improve the image quality by eliminating rain droplets". The one or more text prompts 112 may be from one or more of the following categories: addition or reconstruction of the object; scene refinement; ; and/or stylization. Addition or reconstruction of the object may assist to visualise and locate the object in any images subsequently generated. Scene refinement may comprise techniques that improve or enhance the clarity, quality, and/or details of images, such as noise removal or reduction, super-resolution, deblurring, contrast enhancement, colour correction, brightness adjustment, sharpness enhancement, environmental condition removal (e.g., removal of snow, rain, haze, and fog), inpainting, etc. In some embodiments, the one or more text prompts 112 may be selected from a predefined list of editing instructions which may be applied sequentially. In some embodiments, a universal text instruction set may be utilised in the form of sequential prompts. In some embodiments, the universal text instruction set may be defined by the source code or may be sourced from the dataset. The following is an example of text prompts that may be applied sequentially: 1. Enhance the image by eliminating unwanted weather effects like snow, rain, haze 2. Increase the overall brightness of the image 3. Deblur all the traffic participants 4.
[0058] According to some embodiments, the one or more text prompts 112 may be generated based on a voice input from a user which is converted to text using a speech-to-text model. In some embodiments, the speech-to-text model may be configured to convert speech instructions from a user into text and forward the text as inputs into the machine learning model configured to refine the continuous function 108. An example of a speech-to-text model is DeepSpeech available at https://github.com/moz 11a/DeepSpeech.
[0059] According to some embodiments, step S6 may comprise iteratively updating the source plurality of images 104 at the captured viewpoints with the conditioned diffusion model, with the edits consolidated in 3D using standard NeRF training to generate refined continuous function 116 that was subject to the one or more text prompts 112, as well as edited versions of the plurality of images 104. More information on a method of refining the continuous function 108 may be found in "Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions" by Hague et. al. (arXiv:2303.12789v2). It is appreciable that any known method or machine learning model capable of refining a continuous function representative of a scene based on one or more text prompts may be employed.
[0060] According to some embodiments, method 100 may comprise step S8 wherein an outsider perspective image 120 is generated based on the refined continuous function 116 and a desired viewpoint 124. The desired viewpoint 124 may be any viewpoint. In some embodiments, the desired viewpoint 124 may be provided by a user, or a predefined viewpoint. In some embodiments, the desired viewpoint 124 may be a viewpoint predicted using a viewpoint prediction model 128. In some embodiments, the viewpoint prediction model 128 may be trained using reinforcement learning techniques, and preferably model-free reinforcement techniques. In some embodiments, the viewpoint prediction model 128 may use a classification approach wherein the predicted viewpoint falls into predefined or fixed classes, or a regression approach, wherein the predicted viewpoints are not predefined or fixed. In some embodiments, the viewpoint prediction model 128 may predict desired viewpoint 124 using a rendered bird-eye view image 132 generated using the refined continuous function 116 or the continuous function 108, wherein a bird's-eye view image is an image from an elevated view of the object and/or its location from a very steep viewing angle or viewpoint. In some embodiments, the viewpoint prediction model 128 may receive one or more additional inputs 136 to predict the desired viewpoint 124. The one or more additional inputs 136 may include: previously predicted desired viewpoints, previously generated bird's-eye view images, a lidar signal, a radar signal, a turn signal, a steering wheel signal, and/or a gear position signal. The one or more additional inputs 136 may be received from one or more sensors of the object and/or a memory.
[0061] In some embodiments, step S8 of generating an outsider perspective image 120 may comprise computing a plurality of rays that go from a virtual camera at the desired viewpoint 124 through the scene; sampling at least one point along each of the plurality of rays to generate a sampled set of locations; using the sampled set of locations and their corresponding viewing direction as input to the continuous function (or NeRF network) to produce an output set of colours and densities; and accumulating the output set of colours and densities using classical volume techniques to generate a novel image of the scene at the desired viewpoint 124. In some embodiments, the NeRF can be generally considered to be based on a hierarchical structure. Specifically, the general overall NeRF network architecture can, for example, be based on a coarse network and a fine network. In this regard, a first scene function (e.g., a "course" scene function) can, for example, be evaluated at various points (e.g., a first set of points) along the rays corresponding to each image pixel and based on the density values at these coarse points (e.g., evaluation at various points along the rays) another set of points (e.g., a second set of points) can be re-sampled along the same rays. A second scene function (e.g., a "fine" scene function) can be evaluated at the re-sampled points (e.g., the second set of points) to facilitate in obtaining resulting (fine) densities and colours usable in, for example, NeRF rendering (e.g., volume rendering mechanism associated with NeRF). In a general example, to enable gradient updates of the "coarse" scene function, NeRF can be configured to reconstruct pixel colours based on outputs associated with both the "course" scene function and the "fine" scene function, and minimize a sum of the coarse and fine pixel errors. Although fully connected neural network has been generally mentioned, it is appreciable that other parametric functions designed for 3D rendering can possibly be applicable. An example can be PlenOctrees (i.e., an octree-based 3D representation which supports view-dependent effects). In this regard, it is appreciable that fully connected neural networks and/or other parametric functions designed for 3D rendering like PlenOctrees can possibly be applicable.
[0062] Fig. 2 is a schematic illustration of a method 200 of predicting a desired viewpoint 124 using a viewpoint prediction model 128 trained using reinforcement learning techniques, in accordance with embodiments of the present disclosure. According to some embodiments, method 200 may comprise step S20 wherein the current bird's-eye view image 132 (i.e., bird's-eye view image generated at a current timepoint or iteration) and a plurality of historical bird's-eye view images 204 (i.e., bird's-eye view images generated at previous timepoints or iterations) are received and input into an object detection and localization module 208 to generate a bird's-eye view image input representation 212. According to some embodiments, object detection and localization module 208 may be configured to generate the bird's-eye view image input representation 212 comprising three parts: * Map: a map comprising information road of geometry. In some embodiments, the drivable roads may be rendered as grey polygons, with undrivable parts rendered as black.
* Historical detected objects: historical bounding boxes of detected surrounding objects (e.g., vehicles, bicycles, pedestrians) in a past sliding window, the sliding window corresponding to the time frame of the plurality of historical bird's-eye view images. In some embodiments, the bounding boxes may be represented as green boxes. In some embodiments, the bounding boxes may be determined using any known object detection algorithms, such as YOLOv8 obtainable from https://github.com/ultralytics/ultralytics.
* Historical ego object state: historical ego object states represented as boxes. In some embodiments, the boxes may be represented as red boxes. State estimation may be determined by vision-based localization algorithms or a combination of Global Navigation Satellite System (GNSS), an inertial measurement unit, and a digital map.
[0063] According to some embodiments, method 200 may comprise step S22 wherein the bird's-eye view image input representation 212 is encoded into low dimensional latent state encoding 216 to reduce the input complexity. In some embodiments, the low dimensional latent state 216 may be encoded using any known encoder algorithms. An example of an encoder algorithm is variational auto-encoder (VAE), wherein details of the architecture and training may be found at least in section III.B and V.B of "Model-free Deep Reinforcement Learning for Urban Autonomous Driving" by Chen et. al. (arXiy:1904.09503y2). In some embodiments, the VAE may be trained on a dataset comprising between 10,000 to 50,000 bird's-eye view images collected from various driving scenarios, wherein the bird's-eye view images are raw images without labels.
[0064] According to some embodiments, method 200 may comprise step S24 wherein the low dimensional latent state encoding 216 is fed into a trained policy network 220 (corresponding to the viewpoint prediction model 128) to predict coordinates (x, y, z, roll, pitch, yaw) of the desired viewpoint 124. In some embodiments, the trained policy network 220 may be trained using model-free reinforcement learning techniques. In automotive applications, the trained policy network 220 may be trained using model-free reinforcement learning techniques using any known driving simulators, such as CARLA simulator available at http s://carl a. org/. In particular, the model-free reinforcement learning techniques are used to find an optimal policy network 7E* which optimizes the expected future total rewards based on an input state s, the policy network n* trained to output an action a based on an input state s. The state s may be the low dimensional latent state encoding 216, and the action a may be the coordinates of the desired viewpoint 124. Any known algorithms may be used for reinforcement, for example Double Deep Q-Network (DDQN), Twin Delayed Deep Deterministic Policy Gradient (TD3) or Soft Actor Critic (SAC), which are described at least in section IV and V of Chen et. at In some embodiments, the reward term for an automotive application may be a four-term reward function r defined as follows: r = r1 + T2 + r3 + c where: * r1 is a term that penalizes for undesired viewpoints predictions, where r1 = -1 if the user switches to manual mode to select the desired viewpoint, otherwise r1 = 0; * r2 is a term that encourages the vehicle object in a moving state, which is set to equal to the speed of the ego vehicle object; * r3 is a term that penalizes collision with surrounding vehicles, where r3 = -10 if there is a collision, otherwise r3 = 0; and * c is a constant that may be set to -0.1, which may be used to penalize the ego vehicle object for stopping.
[0065] In some embodiments, additional techniques such as frame skip and new exploration strategies disclosed in Chen et al. may be included for the training of the trained policy network 220.
[0066] Fig. 3 is a schematic illustration of a method 300 of predicting a desired viewpoint 124 using a classification approach, in accordance with embodiments of the present disclosure. According to some embodiments, method 300 of predicting a desired viewpoint 124 using a classification approach may comprise using a machine learning model 304 or classification model 304 to predict the desired viewpoint 124. According to some embodiments, method 300 may comprise step S30 of feeding birds-eye view image 132 into an image encoder 308 to extract image features for the birds-eye view image 132. The image encoder 308 may be any known machine learning model, such as ResNet-50 disclosed in "Deep Residual Learning for Image Recognition" by He et. at (arXiv:1512.03385) and Vision Transformer (Vi T) disclosed in "An image i s worth 16x16 words: Transformers for image recognition at scale" by Dosovitskiy et. at (arXiv:2010.11929v2).
[0067] In some embodiments, method 300 may comprise step S32 of feeding predefined viewpoint classes 3161-316n into a camera encoder 332 to extract features of the predefined viewpoint classes 316-316.. In some embodiments, the viewpoint classes 3161-3 1 6. may be expressed in text form, and the camera encoder 332 may be a text encoder. Examples of predefined viewpoint classes include Topview, TopDownFront, TopDownRear, FrontLeft, FrontRight, FrontCenter, FrontCenterLeft, FrontCenterRight, RearMiddle, RearMiddleRight, and RearMiddleLeft, with a configuration file holding the mapping or association of each class to specific viewpoint coordinates. An example of a text encoder is a Transformer as disclosed in "Attention is all you need" by Vaswani et. at (arXi v: 1706.03762) and potentially with the architecture modifications described in "Language models are unsupervised multitask learners" by Radford et. al. [0068] According to some embodiments, method 300 may comprise step S34 wherein the image features output from image encoder 308 and the features of the predefined viewpoint classes 316-316" output from the camera encoder 332 are fed into machine learning model 304 (corresponding to the viewpoint prediction model 128) which compares the image features output from image encoder 308 and the features of the predefined viewpoint classes 316-316" output from the camera encoder 332 against image-class pairs to estimate a most probable image-class pair. In some embodiments, machine learning model 304 may be a zero-shot linear classifier synthesized by the camera encoder 332. The most probable image-class pair is then output by the machine learning model 304 in step S36, wherein the desired viewpoint 124 corresponds to the viewpoint coordinates associated with the class of the estimated best image-class pair.
[0069] In some embodiments, the image encoder 308 and camera encoder 332 may be jointly trained to predict the correct pairings of a batch of (image, class) training examples. An example of the joint training method is Contrastive Language-Image Pre-training (or CLIP) as disclosed in "Learning Transferable Visual Models From Natural Language Supervision" by Radford et. al. (arXiv:2103.00020v1). In some embodiments, the image encoder 308 and camera encoder 332 may be trained on a dataset comprising image-class pairs compiled from diverse driving scenarios, each image-class pair comprising a bird's-eye view image and a corresponding viewpoint class, wherein the viewpoint class corresponds to a driver's chosen viewpoint for a specific situation or driving scenario. In some embodiments, the available viewpoint classes may be predefined and may include options such as "Topview, TopDownFront, TopDownRear, FrontLeft, FrontRight, FrontCenter, FrontCenterLeft, FrontCenterRight, RearMiddle, RearMiddleRight, and RearMiddleLeft", with a configuration file comprising the mapping or association of each class to specific viewpoint coordinates. In some embodiments, the dataset may comprise between 10,000 to 20,000 image-class pairs per class, wherein the number classes depends on the predefined configuration. For example, if there are N classes, the dataset may comprise between 10,000N to 20,000N image-class pairs in total.
[0070] Fig. 4 is a schematic illustration of a method 400 of predicting a desired viewpoint 124 using a classification approach incorporating additional inputs 136, in accordance with embodiments of the present disclosure. According to some embodiments, additional inputs 136 may comprise sensor data such as lidar data 404 in the form of lidar point clouds and/or radar data 408 in the form of radar point clouds. According to some embodiments, additional inputs 136 may comprise signals 412 which are predominantly discrete or can be transformed into discrete values using one-hot encoding. Examples of such signals 412 include gear signals, steering wheel position signal, and turn or blinker signals.
[0071] According to some embodiments, method 400 may comprise step S40 wherein the bird's-eye view image 132, the lidar data 404 and/or the radar data are fed into separate feature pyramid networks (FPNs) 4161-4163, each FPN 416 configured to extract modality-specific features 4201-4203. In some embodiments, the modality-specific features 4201-4203 may be aligned in step 842 and then aggregated or fused in step 844 to generate fused features 424. The fused features 424 may be concatenated with the values of the signals 412 in step S46 to generate a concatenation 428 which may be fed into a detection head 432 in step S48 to predict a viewpoint class 436. The details of the steps used in method 400, as well as the training and architecture of any machine learning models used in method 400 may be found at least in section III and IV of "DeepFusion: A Robust and Modular 3D Object Detector for Lidars, Cameras and Radars" by Drews et. al. (arXiv.2209.12729v2), with the modifications that the step of fusion transform in DeepFusion may be omitted as method 400 uses a bird's-eye view image as input instead of camera images, and the machine learning models are only trained with focal loss. In some embodiments, the machine learning models employed in method 400 may be trained on a dataset comprising may be trained on a dataset comprising between 10,000 to 30,000 frames, wherein each frame comprises a bird's-eye view image associated with corresponding viewpoint class, and corresponding additional inputs 136, wherein any additional inputs 136 from sensors are calibrated and synced with the bird's-eye view image. In some embodiments, the viewpoint class corresponds to a driver's chosen viewpoint for a specific situation or driving scenario. In some embodiments, the available viewpoint classes may be predefined and may include options such as "Topview, TopDownFront, TopDownRear, FrontLeft, FrontRight, FrontCenter, FrontCenterLeft, FrontCenterRight, RearMi ddl e, RearMi ddl eRight, and RearMiddleLeft", with a configuration file comprising the mapping or association of each class to specific viewpoint coordinates.
[0072] Fig. 5 is a schematic illustration of a method 500 of predicting a desired viewpoint 124 using a regression approach, in accordance with embodiments of the present disclosure. In some embodiments, method 500 of predicting a desired viewpoint 124 using a regression approach may be adapted from the deep neural network model using long short-term memory network (LSTM) as a backbone disclosed in "Content Assisted Viewport Prediction for Panorammic Video Streaming" by Tan et al by having only two LSTM branches (trajectory LSTM and saliency map LSTM).
[0073] According to some embodiments, method 500 of predicting a desired viewpoint 124 using a regression approach may comprise using a deep learning model 504 or a multi-modal fusion model 504 (corresponding to the viewpoint prediction model 128). In some embodiments, at each prediction step, recently observed viewpoint coordinates (past decisions) and the current bird eye view image 132 may be input into the trained multi-modal fusion model 504 to predict the subsequent viewpoint coordinates as the desired viewpoint coordinate 124.
[0074] According to some embodiments, the deep learning model 504 or multi-modal fusion model 504 may comprise two LSTM (Long Short-Term Memory) branches: a first LSTM branch 508, and a second LSTM branch 512. The multi-modal fusion model 504 may be trained using ADAM optimizer for optimization and Mean Absolute Error (MAE) as the loss function. In some embodiments, the deep learning model 504 or multi-modal fusion model 504 may be trained on a dataset comprising between 10,000 to 50,000 frames compiled from diverse driving scenarios, wherein each frame comprises a bird's eye view image and associated viewpoint coordinates corresponding to a driver's chosen viewpoint for a specific situation or driving scenario.
[0075] According to some embodiments, the first LSTM branch 508 (also termed trajectory LSTM) may be configured to receive a sequence of rz viewpoint coordinates from a historical time window of past viewpoint coordinates 516 and predict a series of in future coordinates in a prediction window. In particular, the first LSTM branch 508 may receive as input sequences of camera coordinates (x, y, z, roll, pitch, yaw) collected within a historical time window. The extent of the historical window may correspond to the depth of the analysis. In some embodiments, the first LSTM branch 508 may comprise a single layer of LSTM comprising 64 neurons and may incorporate a subtraction layer for point normalization after the input layer, and an Add layer to revert the values to their original state before output.
[0076] According to some embodiments, the second LSTM branch 512 (also termed saliency LSTM) may be configured to receive saliency maps 520 of the input bird's-eye view images 132 corresponding to each prediction step and predict in predictions in a prediction window. In particular, the second LSTM branch 520 may receive saliency maps 520 of bird's-eye view images 132 corresponding to each prediction step as input to generate a second set of predictions, i.e., a series of future coordinates. The saliency maps 520 identifies the most visually significant or interesting regions within a video frame/image, which are usually areas within video frames/images that are temporally and visually salient over the course of the video/sequence of images. In some embodiments, the saliency maps 520 may be extracted from the bird's-eye view image 132 and historical bird's-eye view images 204 using the method disclosed in Section 3.3 of Tan et. al. In some embodiments, temporal saliency maps 520 may be extracted by combining the results of a saliency map extraction and a background subtraction. Saliency map may be extracted for a specific video frame/image using a classical feature-intensive method known as Ittykoch which decomposes the image into multiple feature channels based on characteristics such as intensity, edges, colours, and orientations, the feature channels then combined to identify areas of the image that are considered salient or visually significant. Background subtraction may be performed to reduce the areas that are less interesting or relevant, for example by using a Gaussian mixture-based background/foreground segmentation algorithm that aims to distinguish between the changing pixels (foreground) and the static or less changing background between continuous video frames. In some embodiments, the saliency maps 520 may go through 3 convolutional layers and max-pooling layers, followed by feature extraction using a dense layer and application of a flatten step and 1 dense layer to regress a coordinate (x, y, z, roll, pitch, yaw). In some embodiments, the second LSTM branch 520 may comprise a single layer of LSTM comprising 64 neurons and may incorporate a subtraction layer for point normalization after the input layer, and an Add layer to revert the values to their original state before output.
[0077] According to some embodiments, at each prediction step, the predictions from the first LSTM branch 508 (trajectory LSTM) and the second LSTM branch 512 (saliency map LSTM) may be concatenated to yield one final output desired viewpoint 124. In some embodiments, TimeDistributed layers may be applied to the second LSTM branch 512 to ensure that their parameters remain consistent over prediction steps. Information on further hyperparameters for model architecture and training may be found in Section 4 of Tan et. al. [0078] Fig. 6 is a schematic illustration of a method 600 of refining a viewpoint prediction model 128, according to embodiments of the present disclosure. According to some embodiments, the viewpoint prediction model 128 may be further refined or personalised to a user using reinforcement learning techniques.
[0079] According to some embodiments, method 600 may comprise step S60, wherein the outsider perspective image generated in step S8 of method 100 is presented on a display. In some embodiments, method 600 may comprise step S62, wherein one or more interactions may be received from a user adjusting the image presented in step S60. In some embodiments, method 600 may comprise step S64, wherein an adjusted viewpoint is determined based on the adjusted image in step S62. In some embodiments, method 600 may comprise step S66, wherein the viewpoint prediction model 128 is refined based on the adjusted viewpoint determined in step S64. In some embodiments, where the viewpoint prediction model 128 is trained using reinforcement learning techniques, the viewpoint prediction model 128 may be refined using the techniques adapted from the technique disclosed in "Deep Reinforcement Learning for Page-wise Recommendations" by Zhao et.
al. (arXiv:1805.02343v2). In some embodiments, where the viewpoint prediction model 128 uses a classification or regression approach, the frequency of updating the viewpoint prediction model 128 may be set such that an additional iteration of training/updating the viewpoint prediction model 128 may be carried out after each usage episode by the user (e.g., after one or more journeys by the user).
[0080] Fig. 7 is a schematic illustration of a device for implementing some embodiments of the present disclosure, in accordance with embodiments of the present disclosure. Device 700 can be used, for example, for one or more steps of method 100, 200, 300, 400, 500, and/or 600. Device 700 can be a computer connected to a network. Device 700 can be a client or a server. As shown in Fig. 7, device 700 can be any suitable type of processor-based system, such as a personal computer, workstation, server, handheld computing device (portable electronic device) such as a phone or tablet, or an embedded system or other dedicated device. The device 700 can include, for example, one or more processors 710, one or more memory 714, one or more of input device 720, one or more of output device 730, a graphical user interface (GUI) 734, storage 740, and communication device 760.
Input device 720 and/or output device 730 can generally either be connectable or integrated with the device 700. In some embodiments, device 700 may further comprise a plurality of sensors 770 configured positioned to capture a plurality of images representative of a 360-degree view surrounding an object, wherein the one or more processor is configured to perform one or more steps of method 100, 200, 300, 400, 500, and/or 600 to generate an outsider perspective image based on the plurality of images with depth information that capture surroundings of an object.
[0081] Input device 720 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, microphone, mouse, gesture recognition component of a virtual/augmented reality system, or audio-or voice-recognition device. Output device 730 can be or include any suitable device that provides output, such as a display, touch screen, haptics device, virtual/augmented reality display, or speaker.
[0082] Storage 740 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory including a RAM, cache, hard drive, removable storage disk, or other non-transitory computer readable medium. Communication device 760 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the device 700 can be connected in any suitable manner, such as via a physical bus or wirelessly.
[0083] Processor(s) 710 can be any suitable processor or combination of processors, including any of or any combination of a central processing unit (CPU), graphics processing unit (GPU), field programmable gate array (FPGA), and application-specific integrated circuit (ASIC). Software 750, which can be stored in storage 740 and executed by one or more processors 710, can include, for example, the programming that embodies the functionality or portions of the functionality of the present disclosure (e.g., as embodied in the devices or methods as described above). For example, software 750 can include one or more programs for execution by one or more processor(s) 710 for performing one or more of the steps of method 100.
[0084] Software 750 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage 740, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.
[0085] Software 750 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport computer readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.
[0086] Device 700 may be connected to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
[0087] Device 700 can implement any operating system suitable for operating on the network. Software 750 can be written in any suitable programming language, such as C, C++, Java, or Python. In various embodiments, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.
[0088] Fig. 8 is a schematic illustration of an arrangement of a plurality of sensors 804 placed on a vehicle 800, in accordance with embodiments of the present disclosure. In some embodiments, the plurality of sensors 804 may be placed on a rooftop, front, and rear section of the vehicle 800. in some embodiments, the plurality of sensors 804 may comprise a first plurality of sensors 808 placed along a perimeter of the rooftop section of the vehicle 800, a second plurality of sensors 812 placed on a front section of the vehicle 800, and a third plurality of sensors 816 placed on a rear section of the vehicle 800.
[0089] According to some embodiments, the first plurality of sensors 808 placed along a perimeter of the rooftop section of the vehicle may have a distance of between 0.05 to 0.3 m between sensors, preferably a distance of between 0.2 to 0.3 m between sensors positioned at sides of the rooftop and a distance of between 0.05 to 0.1 m between sensors positioned at corners of the rooftop. In some embodiments, the first plurality of sensors 808 may have heights of between 0 to 10 cm from a rooftop surface of the vehicle 800, wherein the sensors preferably have the same height. In some embodiments, the first plurality of sensors 808 may have pitch angles of between -20 and -25 degrees. In some embodiments, the first plurality of sensors 808 may have yaw angles of between 0 and 360. The yaw angle as used within the disclosure pertains to the sensor's vertical axis rotation and determines its left and right movement. The vehicle's epicentre may be the reference point for determining yaw angles, wherein 0 degrees corresponds to the front of the vehicle, 90 degrees corresponds to the right of the vehicle, 180 degrees corresponds to the rear of the vehicle, and 270 degrees corresponds to the left of the vehicle. Preferably, the yaw angles of sensors located on straight or flat regions of the rooftop have a straight-ahead perspective corresponding to yaw angles close to 0, 90, 180, and 270 degrees depending on the position on the rooftop and the direction it is facing, and sensors located on vertices or corners of the rooftops have slight deviation with an angle range of 20 to 30 degrees deviating from the yaw angles of 0, 90, 180, and 270 degrees.
[0090] According to some embodiments, the second plurality of sensors 812 placed on a front section of the vehicle may be positioned within an area between headlights and a front bumper of the vehicle, wherein the sensors preferably have the same height. In some embodiments, the second plurality of sensors 812 may have a distance of between 0.1 to 0.2 m between sensors, and preferably a uniform distance of between 0.1 to 0.2 m between sensors. In some embodiments, the second plurality of sensors 812 may have pitch angles between -20 and -25 degrees. In some embodiments, the second plurality of sensors 812 may have yaw angles between 0 and 30 degrees. Preferably, the yaw angles of sensors lying along a straight line may very between 0 and 10 degrees, and the yaw angles of sensors towards the corner may vary between 20 and 30 degrees in either direction depending on where the camera is located.
[0091] According to some embodiments, the third plurality of sensors 816 placed on a rear section of the vehicle may be positioned within an area between brake lights and a rear bumper of the vehicle, wherein the sensors preferably have the same height. In some embodiments, the third plurality of sensors 816 may have a distance of between 0.1 to 0.2 m between sensors, and preferably a uniform distance of between 0.1 to 0.2 m between sensors. In some embodiments, the third plurality of sensors 816 may have pitch angles between -20 and -25 degrees. In some embodiments, the third plurality of sensors 816 may have yaw angles between 180 and 220 degrees. Preferably, the yaw angles of sensors lying along a straight line may very between 180 and 190 degrees, and the yaw angles of sensors towards the corner may vary between 200 and 210 degrees in either direction depending on where the camera is located.
[0092] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present disclosure are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.