The present application claims priority from U.S. patent application Ser. No. 63/418,050 ("050 application"), entitled "Scene segmentation for Mobile," filed on App. No. 2022, ser. No. 10, month 21 (agency's No. 1282. INNOPEAK-1022-099-P), the entire contents of which are incorporated herein by reference for all purposes.
Detailed Description
The invention relates to a graphics rendering system and method. According to a specific embodiment, the invention provides a method and a system for utilizing world space sampling clusters, programmed instance rejection masks and instance rejection clusters. The present invention may be configured for real-time applications (RTA), such as video conferencing, video gaming, virtual Reality (VR) and extended reality (XR) applications. Still other embodiments exist.
Mobile and XR applications typically need to address various limitations in graphics processing. These limitations can impact the visual quality and performance of the application and present challenges to providing a smooth and attractive user experience. One limitation is the limited processing power of the mobile device. Most smartphones and tablet computers are equipped with relatively small and power-efficient processors compared to desktop computers, which are not designed for intensive graphics processing tasks. Thus, mobile applications often need to efficiently render graphics using optimized algorithms and techniques to avoid overloading the device hardware. Another limitation is the limited memory and storage space of the mobile device. Most smartphones and tablet computers have limited and shared memory, which may limit the complexity and level of detail of graphics that can be rendered, and may require mobile applications to use efficient data structures and algorithms to minimize the memory and storage space they occupy. The third limitation is the limited battery life of the mobile device. Graphics processing may cause significant drain on the battery of the device, so mobile applications must be designed to minimize the impact on battery life to avoid draining power during use. This may involve techniques such as reducing the frame rate, using low resolution graphics, or disabling certain graphics functions when not needed. Another limitation is the limited bandwidth and connectivity of mobile networks. Many mobile applications rely on network connections to access data and resources, but mobile networks can be slow and unreliable, especially in poorly covered areas. This may affect the performance of graphics intensive applications and may require the use of techniques such as data compression and caching to minimize the amount of data that needs to be transmitted over the network. In order to provide a smooth and attractive user experience, mobile applications must be carefully designed, fully taking into account these limitations, and efficiently and effectively render graphics using optimized algorithms and techniques.
It should be appreciated that embodiments of the present invention provide techniques involving, but not limited to, spatial partitioning and forward rendering to improve graphics rendering in mobile devices. Spatial partitioning is a computer graphics technique that improves rendering performance by partitioning a three-dimensional space into smaller, more manageable areas. In real-time rendering, this is achieved by organizing objects in a scene into a data structure that allows for efficient spatial querying by traversing only the geometry relevant to the current query. One data structure used for spatial partitioning in forward rendering is a uniform spatial grid. The uniform spatial grid divides the three-dimensional space into a regular grid of cells, each cell containing a list of objects intersecting it. At the time of rendering, the camera's cone of view (view frustum) is used to determine which cells are in view, and only the objects in these cells are added to the list of objects that need to be rendered. This avoids the need to traverse the entire scene and can significantly reduce the number of objects that need to be considered for rendering. Another data structure commonly used for spatial partitioning in forward rendering is a binary spatial partitioning (binary space partitioning, BSP) tree. The BSP tree organizes objects in the scene into a hierarchical tree structure, with each node representing a plane that divides the space into two regions. At rendering, camera cones are used to determine which nodes are in view and only traverse these nodes to find the object that needs to be rendered. This allows more efficient culling of objects outside the view cone and may further reduce the number of objects that need to be considered for rendering. In addition to improving rendering performance, spatial partitioning may also be used for other tasks such as visibility determination, collision detection, and illumination calculation.
Forward+ rendering is a computer graphics technique that improves the performance and visual quality of real-time rendering. Forward + rendering is an extension of the traditional forward rendering method that individually renders each object in a scene by fully coloring each pixel covered by the object. Forward+rendering uses minimal pre-processing (prepass) to generate at least the screen space depth buffer before the full rendering pass. This depth buffer is used to refuse to execute the more costly main shaders that would render pixels that were occluded by having depth values that are farther than the values in the depth buffer. The forward + rendering pipeline may be further extended by dividing the three-dimensional space into smaller regions (e.g., a grid or tree structure) using spatial division techniques. The number of light sources in the scene is grouped in a screen space subdivision (subdivision) rather than a world space subdivision. A group of pixels (hereinafter referred to as a "grid cell") will contain a number of light sources attached to these pixels. During rendering, each pixel only queries the light sources associated with its assigned subdivision, rather than the complete list of light sources in the scene. This allows for efficient culling of objects outside the view cone and reduces the number of objects that need to be considered for rendering. Once the objects in the field of view are determined, a light source rejection step is used to generate a list of light sources for each grid cell, taking into account only the light sources that affect those objects. This further reduces the number of light sources that need to be processed and may save computational resources. Next, the light source rendering step calculates the illumination of each object. This involves selecting one light source from a list of light sources, evaluating the effect of that light source on the surface of the object, accumulating the resulting color, and repeating this process for other light sources as needed. This step may be performed in parallel on the GPU, allowing multiple light sources to be efficiently processed at a single time. Finally, forward + rendering applies the calculated illumination to the object using a shading step and generates a final image. This step may also be performed on the GPU, allowing real-time rendering of the scene.
Forward + rendering with a list of screen space light sources provides a number of advantages over conventional forward rendering. It allows more efficient culling of objects and light sources, which may improve rendering performance and reduce memory usage. It also allows more light sources to be used in the scene, which may improve the visual quality of the illumination. Furthermore, parallel processing on the GPU allows complex scenes to be rendered in real-time. However, there are also some limitations and trade-offs with this rendering architecture. One disadvantage is the overhead of building and maintaining the spatially partitioned data structure. For dynamic scenes with constantly changing object positions, the cost may be particularly high and if performed inefficiently, the overall rendering performance may be affected. Another problem is the choice of data structure and cell size. The optimal data structure and cell size will depend on the characteristics of the scene and the desired performance, but finding the correct balance can be challenging. Depth preprocessing and light source listing are valuable techniques to improve real-time rendering performance and visual quality. However, these methods are not directly applicable to ray tracing because ray tracing rendering effects rely on rendering the entire scene. The use of a list of screen space light sources in a path tracker can result in a large amount of illumination data being lost during rendering, creating artifacts or other problems during rendering.
It should be appreciated that embodiments of the present invention, as described in further detail below, efficiently implement scene segmentation and forward rendering techniques for mobile applications.
The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a particular application. Various modifications and various uses in different applications will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to a wide variety of embodiments. Thus, the present invention should not be limited to the embodiments described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
The reader of this invention shall note that all documents and documents filed concurrently with this specification, and documents which are public inspection with this specification, are incorporated herein by reference. All the features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed herein is one example only of a generic series of equivalent or similar features.
Furthermore, any element in a claim that is not explicitly stated as "a means for" or "a step for" performing a particular function should not be construed as a "means" or "step" clause as specified in 35U.S. c. clause 112, clause 6. In particular, the "step of the" or "act of the" as used in the claims herein is not intended to refer to the provision of clause 6 of 35u.s.c. 112.
Note that the terms left, right, front, rear, top, bottom, forward, rearward, clockwise and counterclockwise, if used, are used for convenience only and are not intended to imply any particular fixed orientation. Rather, these are used to reflect the relative position and/or orientation between various portions of the object.
Fig. 1 is a simplified schematic diagram of a mobile device 100 for performing graphics rendering using scene segmentation. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications.
As shown, mobile device 100 may be configured within a housing 110 and may include a camera device 120 (or other image or video capture device), a processor device 130, a memory 140 (e.g., volatile memory storage), and a storage 150 (e.g., persistent memory storage). The camera 120 may be mounted on the housing 110 and configured to capture an input image. The input image may be stored in a memory 140, and the memory 140 may include a random-access memory (RAM) device, an image/video buffer device, a frame buffer, and the like. Various software, executable instructions, and files may be stored in the memory 150, and the memory 150 may include a read-only memory (ROM), a hard disk, and the like. Processor 130 may be coupled to each of the components described above and configured to communicate between these components.
In a specific example, the processor 130 includes a central processing unit (central processing unit, CPU), a network processing unit (network processing unit, NPU), and the like. The device 100 may also include a graphics processing unit (graphics processing unit, GPU) 132 that is coupled to at least the processor 130 and the memory 140. In an example, memory 140 is configured to be shared between processor 130 (e.g., a CPU) and GPU 132 and is configured to hold data used by an application program when running. Since the memory 140 is shared, the memory 140 must be efficient when used. For example, high memory usage of GPU 132 may negatively impact system performance.
The device 100 may also include a user interface 160 and a network interface 170. The user interface 160 may include a display area 162 configured to display text, images, video, render graphics, interactive elements, and the like. Display 162 may be connected to GPU 132 and may also be configured to display at a refresh rate of at least 24 frames per second. The display area 162 may include a touch screen display (e.g., in a mobile device, tablet, etc.). Or the user interface 160 may also include a touch interface 164, the touch interface 164 for receiving user input (e.g., a keyboard or keypad in a mobile device, notebook computer, or other computing device). The user interface 160 may be used for real-time applications (RTA) such as multimedia streaming, video conferencing, navigation, video gaming, etc.
The network interface 170 may be configured to send and receive instructions and files for graphics rendering (e.g., using Wi-Fi, bluetooth, ethernet, etc.). In a particular example, the network interface 170 may be configured to compress or downsample the image for transmission or further processing. The network interface 170 may be configured to send one or more images to a server for OCR. The processor 130 may be connected to and configured to communicate between the user interface 160, the network interface 170, and/or other interfaces.
In an example, processor 130 and GPU 132 may be configured to perform steps of rendering video graphics, which may include steps related to executable instructions stored in memory 150. The processor 130 may be configured to execute the application instructions and generate a plurality of graphics data related to a 3D scene comprising at least a first object. The plurality of graphics data may include a plurality of vertex data associated with a plurality of vertices (e.g., each object) in the 3D scene. GPU 132 may be configured to generate a plurality of primitive data using at least a plurality of vertex data and a vertex shader. The plurality of primitive data may include at least location data, etc.
In an example, GPU 130 may be configured to provide a location vector defining the center of the overall cluster data structure in three axes. GPU 132 may also be configured to provide a subdivision scalar that is used to divide the entire cluster data structure evenly along three axes. GPU 132 may then be configured to create clusters by partitioning the bounding box using at least the sub-divided scalar and the range of the overall data structure. GPU 132 may be configured to map geometric location data to associated clusters defined by cluster center locations. GPU 132 may also be configured to generate cluster data corresponding to the clusters by one or more compute shaders executing threads for each cluster. Using the cluster data, GPU 132 may be configured to generate a culling mask for each object. Further, GPU 132 may be configured to render at least the first object (and any other objects) through the shader using at least the cull mask and the cluster data.
To execute the described rendering pipeline, mobile device 100 includes hardware components that need to be configured and optimized to meet such rendering technology requirements. First, the CPU of the device needs to be powerful enough and energy efficient to handle the computation and data processing required for forward + rendering. This typically involves the use of a high performance CPU with multiple cores and threads, as well as specialized instructions and techniques, such as SIMD and out-of-order execution, to maximize performance and efficiency.
Second, the GPU of the device needs to be able to execute the complex shaders and algorithms used in the described rendering pipeline. This typically involves using a high performance GPU with a large number of compute units and fast memory bandwidth, and supporting advanced graphics APIs and functions, such as OpenGL ES 3.0 and compute shaders.
Third, the memory and storage subsystem of the device needs to be large enough and fast enough to support the data structures and textures used to store the spatial grid cells containing the data required for rendering. This typically involves the use of high capacity and high speed memory and storage technologies such as DDR4 RAM and UFS2.0 or NVMe storage, as well as efficient data structures and algorithms to minimize memory and storage usage.
Fourth, the display and touch screen of the device need to have high resolution and fast refresh rates to support the visual quality and interactivity of real-time rendering. This typically involves the use of high resolution and high refresh rate displays, as well as low latency and high precision touch screens to achieve smooth and responsive interaction.
The battery and power management subsystem of the device needs to be able to support the power requirements of real-time rendering. This typically involves the use of high capacity batteries and efficient power management techniques, such as smart charging and power saving modes, to ensure that the device can operate for long periods of time without draining power.
To meet the demands of the rendering pipeline described herein, mobile device 100 possesses powerful and energy-efficient hardware components, including a high-performance CPU, GPU, memory and storage subsystem, display and touch screen, and battery and power management subsystem. These components need to be optimized and configured to support the forward + rendering requirements and to achieve smooth and attractive visual effects and interactions.
Other embodiments of the present system include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the steps of the methods. Details of the method are further discussed below with reference to the accompanying drawings.
Fig. 2 is a simplified flow diagram of a conventional forward pipeline 200 for rendering video graphics. As shown, forward pipeline 200 (i.e., including vertex shader 210, followed by fragment shader 220). In the forward pipeline rendering process, the CPU provides graphics data (e.g., from memory, storage devices, networks, etc.) of the 3D scene to a graphics card or GPU. In the GPU, vertex shader 210 converts objects in a 3D scene from object space to screen space. This process involves projecting the geometry of the object and decomposing it into vertices, which are then transformed and divided into segments or pixels. In the fragment shader 220, the pixels are shaded (e.g., color, illumination, texture, etc.) and then passed to a display (e.g., a screen of a smartphone, tablet, VR goggles, etc.). In the lighting case, the rendering effect will be processed for each light source, for each vertex and each segment in the scene.
Fig. 3 is a simplified flow diagram of a conventional delay pipeline 300 for rendering video graphics. Here, the preprocessing process 310 involves receiving graphics data of a 3D scene from a CPU and generating a G buffer (G-buffer) containing data required for a subsequent rendering channel, such as color, depth, normal, etc. Ray traced reflection channel 320 involves processing the G-buffer data to determine the reflection condition of the scene, while ray traced shadow channel 330 involves processing the G-buffer data to determine the shadow condition of the scene. Then, the denoising channel 340 removes noise for the ray traced pixel. In the main rendering channel 350, reflection, shading, and texture assessment are combined to produce a rendered output with each pixel color. In post-processing channel 360, the rendered output may undergo additional rendering processing, such as color correction, depth of field, etc.
The delay pipeline 300 reduces the total number of fragments compared to the forward pipeline 200 by processing the rendering effect based only on non-occluded pixels. This is achieved by decomposing the rendering process into multiple phases (i.e. channels), wherein the colors, depths and normals of objects in the 3D scene are written to separate buffers, which are then rendered together to produce the final rendered frame. Subsequent channels use the depth values to skip rendering of occluded pixels when executing more complex illumination shaders. The deferred rendering pipeline approach reduces the complexity of any single shader compared to the forward rendering pipeline approach, but multiple rendering channels require greater memory bandwidth, which is particularly problematic in modern mobile architectures where memory is limited and shared.
According to an example, the present invention provides a method and system for world space sampling clusters. The spatial cluster system is configured with vectors defining the scope of the cluster data structure. Default value is {0, 0}, using bounding boxes of the entire scene, and independent of camera position. If set to a non-zero value, a spatial cluster is created centered on the camera and boundaries correspond to the range values for each configuration.
Once the boundaries are established, the scene is further divided into smaller clusters based on user-configured vectors that specify the number of subdivisions on each axis. Default values are {0, 0}, allowing the number of clusters and the boundaries of each cluster to be manually set (as will be further explained below). Each cluster contains data for accelerating the rendering of points within the cluster. These data are generated using a compute shader that executes threads for each cluster (e.g., the shaders shown in fig. 2 and 3).
Any data that is upsampled and can be used simultaneously should be stored in these clusters. In this example, to support a more generic spatial data structure, the light-specific optimization is sacrificed, and therefore a list of light sources ordered by importance is stored, rather than cut points in a hierarchical light source tree. The ReSTIR reservoirs may also be stored in clusters due to the seamless integration of ReSTIR with the light source importance sampling. Cut point data (cut-point data) of cluster culling masks and environment map visibility may also be stored in clusters.
Finally, at run-time, cluster data is loaded once to color a hit point (hit point) based on its location. Depending on the application, if the memory footprint of the hash map and the complexity of implementing an efficient hash scheme are manageable, then the use of the hash scheme for the lookup may be considered. However, in this case, the hit point locations are divided into their associated clusters according to the previously described ranges and subdivisions. If the subdivision portion is {0, 0}, then cluster IDs are derived from the instance culling mask, enabling clustering methods that rely on non-spatial properties (e.g., texture parameters).
Embodiments of such a generic spatial clustering system may be used to segment a scene to selectively load fragment data and share data within a given fragment. For small scenes with fewer clusters, the memory footprint and the number of computational schedules required to update the clusters for rendering are small, although they vary as the number of subdivisions varies. The low minimum cost of such a simple spatial clustering system makes it suitable for mobile devices. Compared with the screen space clustering method, the method introduces the dependency relationship between scene scale and performance, but the provided world space locality is more meaningful for ray tracing than the projection locality provided by the screen space clustering method.
Furthermore, spatial clusters can be reused for various sampling improvements by avoiding partition optimizations that are specific to any single sampling optimization. While this compromises the quality of any single sample optimization, it decouples the build and update times of the data structure from the number of optimizations implemented. This is particularly important when the grid cells are camera-centered, as each camera movement requires an update.
In general, the culling mask helps reduce the cost of initializing inline ray tracing (INLINE RAY TRACING) by reducing the proportion of acceleration structures that are loaded into memory and traversed. Only the instances in the acceleration structure that match the cull mask will be loaded. According to an example, the present invention provides a method of programmatic setting of instance rejection masks based on different heuristics. Five example methods of which are discussed below. Since the ray tracing requirements for each scene will vary from scene to scene content, these methods can be tested during development and the most appropriate method for the scene selected by inspection. For a given scenario, the selected segmentation method may be computed programmatically at load time to reduce the computation cost at run-time. In the following drawings, the coordinate frames used for these descriptions are coordinate frames with the Y axis as the upward direction, the X axis from right to left on the screen, and the Z axis from inside the screen to outside.
In a specific example, there are two important culling mask values, which will be referred to as culling to (cull-to) and culling from (cull-from) masks. Culling to a mask refers to a mask value that is packed with instance data built in the acceleration structure. Only non-zero rays that have mask values and are rejected to the mask are intersected by these instances after the boolean and operation. The culling self-mask refers to a mask value that is packed with instance data that is directly accessed by the runtime shader code. Culling self-masks provides a mask value for rays issued from the hit point of the corresponding instance. An example may have different cull-to-and cull-from mask values. In the following example, the values will be numbered from 1 to 8, which corresponds to the bits set on the 8-bit flush mask, with all other bits being zero. Of course, other variations, modifications, and alternatives are also possible.
The spatial grid is configured with a range vector and a center vector. By default, these are filled from the scene center and scene range, but the center may also be set as the camera position. For all grids below, the range is divided by the number of segments on a given axis to determine the boundary locations between segments. The internal boundaries are placed computationally, but for a grid with external faces, it is assumed that these faces extend to encompass the rest of the scene. The boolean flag (e.g., traceLocalOnly) may be enabled if the grid is intended to capture only instances that are strictly within range. This functionality is applicable to scenes where the contribution of distant objects to the ray tracing effect is not great.
According to an example, the present invention provides a method of assigning scene instances to segments based only on instance transforms (i.e., only spatial segmentation). Instances overlapping a given segment will be assigned to that segment. An instance overlapping multiple segments will be assigned to all overlapping segments by setting the bit corresponding to these segments to 1. For space-only segmentation, the culling-to-mask and culling-from-mask are always equal.
Fig. 4A is a simplified schematic diagram of a spatial-only aliquoting method according to an embodiment of the invention. The most straightforward way to divide a scene is to divide the scene into equal parts. This is just the method employed by world space clusters, given a list of instances, to calculate the boundary of a scene, which is then divided into equal-sized volumes. For an 8-bit instance culling mask, the number of volumes must be limited to eight. As shown in diagram 401, the most straightforward approach is to use two layers of four segments (uniform grid, segment size 2x2x 2), defined by an Axis Aligned Bounding Box (AABB). AABB allows efficient comparison of boundaries with each instance for allocation into fragments.
However, most game checkpoints (e.g., RPG or FPS checkpoints) are constrained to move on a single plane, and the scene data is more closely aligned in the Y-direction, so there is little need to subdivide the object according to the Y-position. Fig. 4B is a simplified schematic diagram of a spatial single plane only segmentation method according to an embodiment of the present invention. As shown in illustration 402, the number of segments along the Z-axis or X-axis is doubled (2D grid, segment size 2X1X 4). This axis may be selected according to a larger range of the overall scene bounding box.
Finally, for scenes with large areas of transmissive and reflective elements, such as bodies of water, the scene will be partitioned using X-Z planes. However, these scenes typically have more scene geometry above or below the water surface. Fig. 4C is a simplified schematic diagram of a spatial asymmetric segmentation-only method according to an embodiment of the present invention. As shown in diagram 403, only two halves of the fragments are allocated to the Y-subsection with fewer instances, and the remaining six fragments are allocated to the Y-subsection with more instances (X-Z grid, 2X1 fragments above and 3X2 fragments below).
In a specific example, this approach can be further refined by setting the Y subsection to the value of the instance with the largest X-Z boundary square. Those of ordinary skill in the art will recognize other variations, modifications, and alternatives to these spatial-only segmentation methods (e.g., segmentation along different axes, different numbers and proportions of segments, etc.).
All the previously described grids can be centered directly on the camera position and sized according to the range of configurations. However, another mesh view dedicated to camera-centric rendering is also presented herein. Thus, according to an example, the present invention provides a camera-centered segmentation method.
In general, objects close to the camera are more attractive and therefore have to be rendered in higher detail than objects far away. The camera-centered grid takes advantage of this and distributes the grid cells according to the camera's transformation. Fig. 4D is a simplified schematic diagram of a camera-centered segmentation method according to an embodiment of the present invention. As shown in diagram 404, four segments are allocated to cover a configurable maximum radius around the camera (labeled 1 through 4), with the remaining four segments (labeled 5 through 8) dividing all remaining instances in the scene into four quadrants.
Segment boundaries may create visible artifacts in indirect lighting, especially on mirror-like surfaces, which may show sharp cuts (cutoff) in reflection between instances assigned to different segments. In an example, the grid is aligned at a 45 degree angle to the camera local space in the X-Z plane to reduce the visibility of these cut-off artifacts. This requires maintenance of a separate "rotated view matrix" for transforming the instance from world space to rotated camera local space. Note that the bounding box of each instance must also be transformed correctly. The AABB comparison can then be performed normally in the rotated view space.
In an example, to avoid complicating the comparison of the non-square shaped outer segments 5-8, all examples are first compared to the configuration radius of the inner segment. Any instance that overlaps the radius may be added to an instance list that will only test whether it is included in the inside segment, and all other instances may be added to another instance list that will only test whether it is included in the outside segment. This eliminates the need to test the outer boundary during boundary testing of the two lists, simplifying boundary testing into quadrant testing.
According to an example, the present invention provides a method of texture-based scene segmentation. These methods take advantage of the fact that some rays only need to intersect certain types of materials. For example, subsurface rays need only intersect a material with subsurface scattering, while shadow rays must ignore the emitting material to properly create shadows. These approaches also consider that the spatial properties of an instance can affect the correlation of indirect illumination with different materials differently. For example, a perfectly smooth mirror may be expected to reflect distant objects, while a rough object is typically only affected by indirect illumination of nearby objects.
In particular examples, material classes may include thick subsurface, light source, alpha cutoff, smooth transmission, smooth reflection, rough object, center, large projection scale, and the like. Determining the type of material may include logically examining properties such as subsurface scattering, refractive index (index of refraction, IOR), emission, opacity, roughness, distance from center, center view size, etc. In addition, each material class may define culling-to-and culling-from values, respectively, as described above.
Based on the spatial properties of each instance, a center class and a large projection scale class are assigned. Instances that have been assigned to other material culling groups may still be considered for assignment to both culling groups. They may be set with respect to a currently active camera or a user configured view projection matrix. The latter may be useful when the scene content distribution makes the ray tracing effect only noticeable at the static location of interest.
The center class is calculated based on the view space distance of each instance relative to the origin. When considering the large projection scale class, all instances assigned to the center class are excluded.
To calculate the projection scale of the instance, the projection matrix of the camera is multiplied by a modified view matrix, which rotates the instance to the (0, d) position in the modified view space with the distance d between the camera and the instance. Then, after each instance is projected to the view plane center, the projection range of each instance bounding box can be used as a surrogate indicator of the importance of the instance in indirectly rendering the appearance. If the projected X or Y range exceeds a user-configured threshold, then the instance is added to the specified class (e.g., class 8 for an 8-bit mask).
The bandwidth consumption of ray tracing procedures (e.g., rayQuery initialization) is a major bottleneck for mobile ray tracing. For example, by using an instance culling mask to initialize rayQuery with only 1/8 of the scene geometry, bandwidth consumption is significantly reduced. This achieves a 15-20fps speed boost on the mobile device when rendered at 1697x760 resolution on a test scene containing 35,000 triangles. The example culling mask of this test scene is manually configured according to the appearance of the scene, specifically according to the layout and texture of the scene geometry.
The method for programmatically setting these culling masks replaces the cumbersome manual configuration step with a simple mode selector, allowing the user to select the segmentation mode that best suits his scene by checking. Furthermore, an embodiment of the method is to specify an instance mask for the first time based on the instance's scale or position in the scene. Furthermore, the programmatic setting of the instance rejection mask does not require additional manual adjustments by the user.
According to an example, the present invention provides a method of sharing scene cut data between a world space cluster and an instance culling system. In an example, both the spatial sampling clusters and the programmatic instance culling mask are managed by an overall scene segmentation manager. These two functions can be selectively linked, sacrificing the resolution of the spatial clusters and the flexibility of the instance rejection mask in exchange for the performance improvement brought by greater reuse.
When chained, for an 8-bit mask, the maximum number of clusters of bitwidth-limited space clusters of the example culling mask is eight. At such lower resolutions, spatial clusters are less capable of capturing high frequency illumination differences, and therefore optimization based on spatial clusters will not bring much improvement for scenes with hundreds of different characteristic small light sources. However, most modern games are first or third person games, where the camera view is positioned at the player character and most scene objects are similar in scale to the player character, where the scene is well covered with only eight clusters.
On the other hand, instance rejection masks may only be able to assign categories using space-only methods. Attempting to construct a bounding box from all coarse objects in a scene can almost certainly result in a bounding box that overlaps other bounding boxes and spans a large enough volume that the available data cannot be shared within that volume.
As previously described, the pre-computed boundaries may be used to overlay the programmatically subdivided portions of the scene boundaries. When the spatial sampling cluster is linked with a programmatic instance rejection mask class, the programmatic instance rejection mask class may generate a bounding box for the active mesh pattern and pass it to the spatial cluster class. Then, since the spatial clusters are the same as the instance culling class, the culling-to-value bit offset of the instance being colored can be directly used to determine which spatial cluster contains relevant data for improving the coloring. Since each instance can only be associated with one spatial cluster, this limits the programmatic instance rejection mask so that each instance's rejection to a value must be "one-hot", i.e., only one non-zero bit. Thus, the previous method can be extended to check for overlap between an instance and all candidate categories and assign only the instance to the category with the largest overlap.
Sharing scene segmentation data between world space clusters and instance culling systems reduces the overhead of maintaining both methods separately because the scene data structure only needs to be computed or updated once and then can be shared between both methods. This sacrifices the resolution of the spatial clusters (and thus the quality of the scene with high frequency direct illumination) and can only be used with the spatial-only approach of programming instance rejection masks. Nevertheless, the example culling cluster camera-centric approach achieves acceptable rendering quality and improved performance for typical first and third person game content.
Fig. 5 is a simplified flow diagram of a method 500 of rendering graphics on a computer device according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, one or more steps may be added, deleted, repeated, replaced, modified, rearranged and/or overlapped, which should not limit the scope of the claims.
According to an example, the method 500 may be performed on a rendering system, such as the system 100 in fig. 1. More specifically, the processor of the system may be configured by executable code stored in a system memory storage (e.g., persistent storage) to perform the operations of method 500. As shown, method 500 may include a step 502 of receiving a plurality of graphics data associated with a three-dimensional (3D) scene, the 3D scene being rasterized to determine at least a first object (or all object instances in the scene) that is to be intersected by a chief ray passing through a plurality of screen space pixels. This may include determining all data required for the first object to be intersected by the ray passing through each pixel in the viewport, wherein the plurality of graphics data includes a plurality of vertex data associated with a plurality of vertices in the 3D scene. In an example, the method includes generating a plurality of screen space primitive data using at least a plurality of vertex data. The plurality of primitive data includes at least a barycentric coordinate and a triangle index, but may also include other location data.
In step 504, the method includes providing a position vector defining a center of the global cluster data structure in three axes. In a specific example, the center of the global cluster data structure is independent of, at, or at the camera location. In step 506, the method includes providing a subdivision scalar for uniformly dividing the global cluster data structure along three axes. In step 508, the method includes partitioning the bounding box by using at least the sub-divided scalar and the range of the overall data structure to create clusters. The partitioning bounding box may include any of the segmentation techniques discussed previously and variations thereof.
In step 510, the method includes mapping geometric position data to an associated cluster defined by a cluster center location. In step 512, the method includes executing, by one or more compute shaders, threads for each cluster to generate cluster data corresponding to the cluster. In step 514, the method includes generating a cull mask for each object using the cluster data. In particular examples, rejecting the mask includes rejecting to the mask and/or rejecting from the mask. As previously described, the culling mask may be specified based on clusters corresponding to equal scene segments, or based on material and distance of each object relative to the camera.
In step 516, the method includes rendering at least the first object (or all objects represented in the screen space primitive buffer) by the shader using at least the cull mask and the cluster data. The shader may be configured in a computer device, such as device 100 shown in fig. 1, and may include a pipeline configuration, such as that shown in fig. 2 and 3. In addition, the method may include storing ReSTIR reservoir data, an ordered list of light sources, or the like.
FIG. 6 is a simplified flow diagram of a method 500 of rendering graphics on a computer device according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, one or more steps may be added, deleted, repeated, replaced, modified, rearranged and/or overlapped, which should not limit the scope of the claims.
According to an example, the method 600 may be performed on a rendering system, such as the system 100 in fig. 1. More specifically, the processor of the system may be configured by executable code stored in a memory storage (persistent storage) of the system to perform the operations of method 500. As shown, the method 500 may include a step 602 of generating a three-dimensional (3D) scene containing a first object (or containing all object instances in the scene). In a specific example, the 3D scene includes a first object intersected by a chief ray.
In step 604, the method includes receiving a plurality of graphics data associated with a 3D scene. In a specific example, the plurality of graphics data includes a plurality of vertex data associated with a plurality of vertices in the 3D scene. In an example, the method further includes generating a plurality of primitive data using at least the plurality of vertex data and the vertex shader, the plurality of primitive data including at least the location data.
In step 606, the method includes providing a position vector defining a center of the global cluster data structure in three axes. In step 608, the method includes providing a subdivision scalar for dividing the global cluster data structure along three axes. In step 610, the method includes partitioning the bounding box by using at least the sub-divided scalar and the range of the overall data structure to create clusters. In a specific example, the bounding boxes are uniformly divided, divided according to horizon in a 3D scene, or the like and combinations thereof. In an example, the method further includes obtaining settings for partitioning the bounding box. The partitioning bounding box may include any of the segmentation techniques discussed previously and variations thereof.
In step 612, the method includes mapping the geometric position data to an associated cluster defined by a cluster center location. In step 614, the method includes executing, by one or more compute shaders, threads for each cluster to generate cluster data corresponding to the cluster. In step 616, the method includes generating a cull mask for each object (e.g., first object) using the cluster data. As previously described, the method may further include defining a culling mask based at least on the opacity of objects in the 3D scene.
In step 618, the method includes rendering, by the shader, the 3D scene using at least the cull mask and the cluster data. As previously described, the shader may be configured in a computer device, such as device 100 shown in fig. 1, and may include a pipeline configuration, such as shown in fig. 2 and 3.
While the above is a complete description of the specific embodiments, various modifications, alternative constructions, and equivalents may be used. Accordingly, the foregoing description and drawings should not be deemed to limit the scope of the invention, which is defined by the appended claims.