Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
Telefonaktiebolaget LM Ericsson AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget LM Ericsson ABfiledCriticalTelefonaktiebolaget LM Ericsson AB
Publication of WO2025049649A2publicationCriticalpatent/WO2025049649A2/fr
Publication of WO2025049649A3publicationCriticalpatent/WO2025049649A3/fr
H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
H04N19/167—Position within a video image, e.g. region of interest [ROI]
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
H04N19/132—Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
H04N19/59—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
Definitions
the focus of the existing techniquesis on high-precision estimation of user’s gaze using one or few features rather than robust estimation of the user’s gaze for the purpose of media encoding. These methods require significant adaptation for usability in encoding pipelines.
the feature weightsmay be determined based on the determined category.
the methodfurther includes determining one or more feature maps for one of the frames of the media stream. Determining one or more feature maps may include determining at least one of a saliency map for the frame; an object detection map for the frame; an optical flow map for the frame; a center and/or horizon bias map for the frame; and a contrast and/or brightness map for the frame.
the methodfurther includes determining a gaze direction density map based on the feature weights and the feature maps. Determining the gaze direction density map may include summing scores across the one or more feature maps after multiplying by the associated feature weights.
the gaze direction density mapmay be a probability density of gaze direction over the frame.
the methodfurther includes determining a tile-level foveation map based on the gaze direction density map.
the methodfurther includes encoding the frame using the tile- level foveation map into an encoded frame that includes encoded tiles of different resolutions.
the different resolutionsmay include lower resolutions for tiles that have a low score and higher resolution for tiles that have a relatively higher score.
the methodfurther includes transmitting the encoded frame to a user device.
the methodmay further include transmitting the tile level foveation map for the media stream to the user device.
one or more embodiments of a non-transitory computer-readable medium or distributed media containing computer-executable program instructions or code portions stored thereonare disclosed for performing one or more embodiments of the methods of the present invention when executed by a processor entity of an apparatus, an electronic device, or other computing device.
Figure 1illustrates a block diagram of an exemplary system for perceptually optimized immersive video encoding, in accordance with some embodiments.
Figure 2illustrates a flow diagram of exemplary operations that can be performed to obtain a gaze prediction for a media stream.
Figure 3Aillustrates a flow diagram of exemplary operations performed for determining a set of feature weights for a media stream, in accordance with some embodiments.
Figure 4Billustrates different exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments.
Figure 5shows an example of a communication system in accordance with some embodiments.
Figure 6shows a UE in accordance with some embodiments.
Figure 7shows a network node in accordance with some embodiments.
Figure 8is a block diagram of a host, which may be an embodiment of the host of
Figure 9is a block diagram illustrating a virtualization environment in which functions implemented by some embodiments may be virtualized.
Figure 10shows a communication diagram of a host communicating via a network node with a UE over a partially wireless connection in accordance with some embodiments.
Certain aspects of the disclosure and their embodimentsmay provide solutions to the challenges of gaze estimation techniques or other challenges.
the embodiments hereindescribe a perceptually optimized zero-shot gaze estimation method for a media stream.
the media streamis a 360-degree video.
a media streamis taken as input and a series of feature extraction, refinement, and estimation techniques are used to generate a set of user gaze likelihood estimates across tiles or subpictures for each frame or set of frames that can be passed onto the encoder to refine the encoding of media.
the feature extraction, refinement, and estimation modules used across different types of media streamsare optimized via a pre-processing sequence.
a “few shot” gaze prediction optimization methoduses a lightweight few-layer neural network located on the client device. This optimization takes zero-shot gaze prediction and the on-client gaze or headset viewport paths for each frame as inputs and sends back a revised series of weights for a user or small set of users across feature modules to the server (or an intermediate server for weight aggregation) to improve the estimation for any individual or group of users.
the embodiments described hereinpresent a method using an adaptive consensus algorithm for gaze estimation that calibrates the feature extraction, estimator selection, and aggregation protocol for gaze estimation based on perceptual features of the media (preprocessing module discussed below).
the embodiments hereindescribe a media pre-processing module that calibrates the weights to apply when aggregating gaze-prediction estimators using text and optical flow based cues extracted from a sample of frames; an adaptive, highly modular gaze estimation mechanism that takes as input several feature-based estimators of saliency, movement, object classification, and subjective interestingness, weights them via the pre-processing module, and adapts them iteratively to the task of gaze estimation (e.g., through thresholding, dynamic weighting, and foveation); and a protocol that transforms the gaze estimation score into a weighted quality-gaze indicator (Q*) to feed to a media encoder for downstream media transport and processing.
Some embodimentspresent a client-side federated learning-based solution for providing privacy-preserving refinements to the feature weighting introduced in the gaze estimation protocol based on individual or aggregated user data.
the embodiments hereinenhance privacy preservation in video encoding systems.
the embodimentsallow for a more privacy-preserving implementation of gaze prediction by removing the need to collect and offload highly identifiable gaze or headset viewport information to a streaming server. Even in the few-shot implementation below, gaze information is used only on the client-side, thus preserving the privacy of the viewer even when gaze information is used to refine the estimator.
user device 104Acan be a standalone HMD that is operative to connect to a remote electronic device 102 through network 105 without an intermediary electronic device.
One or more of user devices 104A-Nare operative to receive encoded media streams, decode, and display the media streams.
User devices 104A-Nare operative to decode and render several types of 360° video content that may be encoded and bandwidth-optimized according to the embodiments described in additional detail below.
one or more of the user devices 104A-Nis operative to run a lightweight learning model (for few-shot refinement).
the user deviceincludes processing capabilities that allow it to run a lightweight learning model estimating weights across features for a given user experience.
System 100further includes remote electronic device 102.
Remote electronic device 102is an electronic device that is remote from a user device, e.g., from user devices 104A-N (e.g., connected to the user device through a wide area network (WAN)). Alternatively, or additionally, remote electronic device 102 connects to the user device through a local area network.
Remote electronic device 102includes optional decoder 131, media streams 112, gaze map determiner 117, encoder 114, and transmitter 121.
the remote electronic device 102includes a graphics processing unit (GPU). The GPU is operative to perform tensor-level operations and other graphics processing operations to accelerate one or more of the operations described below with respect to the gaze map determiner 117 and/or the encoder 114.
the GPUis operative to perform tensor-level operations and other graphics processing operations to accelerate one or more of the operations described below with respect to the gaze map determiner 117 and/or the encoder 114.
Decoder 131is operative to decode and process video inputs.
the video decodermay include a pipeline that takes an encoded video stream as input and decodes it into a manipulable/transmittable form.
the output of the decodercan include media streams 112.
decoder 131may not be included as the media stream can be received in a decoded form.
Remote electronic device 102further includes encoder 114.
Encoder 114is operative to encode or compress the media stream using a codec according to one or more video encoding formats, e.g., H.264 or Advanced Video Coding (MPEG-4 AVC), High Efficiency Video Coding (HEVC) or H.265 (MPEG-H Part 2), H.262 (MPEG-2), H.264 (MPEG-4, Part 2), Alliance for Open Media (AOMedia) Video 1 (AVI), H.266, Versatile Video Coding (VVC), Future Video Coding (FVC), etc.
encoder 114is operative to perform tile encoding.
encoder 114is operative to generate encoded media streams of multiple bitrate representations of an input video stream corresponding to a 360° immersive video asset or program.
Each bitrate representationhas a certain video quality level and may be encoded to contain frames with appropriately modified tile, frame and/or slice data that optimizes bandwidth, video quality, and/or latency of the media stream’s distribution.
System 100is operative to enable predictive mixed-scale encoding of media streams.
Gaze map determiner 117is operative to determine a Q* predicted gaze map.
the Q* predicted gaze mapis generated as described in further details with reference to Figures 2-4.
an encoded sceneis generated from the media stream and based on the Q* predicted gaze map.
the encoded sceneincludes encoded tiles determined based on the set of foveation weights of the Q* predicted gaze map.
the encoded sceneis transmitted to be displayed on a user device.
the set of foveation weightsis a first set of foveation weights and the foveation weight map includes a second set of foveation weights.
the encoded scenefurther includes second encoded tiles determined based on the second set of foveation weights.
the second encoded tilesare of lower resolution than the first encoded tiles.
Figure 2illustrates a flow diagram of exemplary operations that can be performed to obtain a gaze prediction for a media stream.
a media stream 112Ais input to the gaze map determiner 117.
the media stream 112Aresults from the decoding of an encoded video stream.
the media streamcan include a sequence of RGB frames.
the media streamis fed to a pre-processing module 220.
Preprocessing module 220is operative to categorize the media stream into one of several categories. Pre-processing module 220 further determines based on the assigned category, a set of one or more weights. Each weight is associated with a feature type. The weights are to be used for combining the feature maps of the media stream.
pre-processing module 220may include sampler 221. Sampler 221 is operative to select one or more frames from the media stream. In some embodiments, sampler 221 selects a sample of N frames (e.g., 20 or 30 frames) from the plurality of frames that form the media stream.
the set of framesis used by the category determiner 222 to determine a category to the media stream.
the category determiner 222may determine the category to assign to or associate with the media stream based on one or more parameters.
determining a categoryincludes determining a pair of parameters including a first and second parameter.
the first parameterwhich is also referred to as the clutter parameter, is indicative of a degree of clutter or dispersion of regions of interests in a frame.
a region of interestcan include a person, an object, a type of vegetation or landscape or any other type of object that can be of interest to the viewer in a scene/frame.
the second parameterwhich is also referred to as the camera motion parameter, is indicative of the degree of camera motion for the media stream.
the weight determinationmay fall back to increasing saliency models weights and simple biases weights.
the optical flow-based methodis used to judge camera motion, intermediate levels of optical flow suggest one-to-few moving objects, which humans tend to track with gaze. In such cases, optical flow is upweighted.
Gaze map determiner 117is operative to determine based on the feature weights a gaze path for the media stream. Gaze map determiner 117 initialize gaze direction for a frame. In some embodiments, the gaze direction is initialized at the center of the frame. Gaze map determiner 117 further includes a feature maps determiner 230. Feature maps determiner 230 is operative to determine one or more feature maps for each frame of the media stream.
determining one or more feature maps for the frameincludes determining one or more of a saliency map 231 for the frame, an object detection map 232 for the frame, an optical flow map 233 for the frame, a center bias and/or horizon bias map 234 for the frame, and/or a contrast/brightness map 235 for the frame.
determining an object detection map 232 for the frameincludes using an object detection model to obtain bounding boxes or instance segmentations of predefined objects of interest (e.g., humans). Using the bounding boxes, a binary mask is constructed where each pixel is labelled positive or negative based on whether it is part of an object of interest or not.
predefined objects of intereste.g., humans
determining an optical flow map 233includes using a model for optical flow, deriving a map representing the motion (e.g., magnitude of motion) of each pixel in a frame.
determining center biasincludes adding positive bias in the center of the frame (e.g., 2x2 center in a 4x8 tiled frame). This may be uniform (constant) or weighted as per saliency in each quadrant of the frame.
determining horizon biasincludes adding a positive bias in the center in the center row of the frame (e.g., a 2x8 horizontal band in a 4x8 tiled frame).
Each of these feature map determinersproduces a feature frame for each frame of the media stream that is of the same width and height as the frame, with rescaling applied if necessary.
Combiner 240sums scores across the feature maps after multiplying by the associated feature weights to obtain a gaze direction density map 241, which is referred to as f . , which can be considered a probability density of gaze direction over the frame.
Raw gaze direction predictor 250determines from the gaze direction density map 241 one or more predicted gaze directions for the frame.
a predicted gaze directionis the gaze direction of most viewers when a user views the frame of the media stream.
a foveation areais created.
the foveation areacan be centered at the predicted gaze direction (when the predicted gaze direction is a single entity). Alternatively, the foveation area can be determined from several regions, each region being centered at one of the predicted gaze directions for the frame.
a foveation areaincludes a set of foveation weights.
a weight from the set of foveation weightsis indicative of a resolution at which to encode a tile from one or more tiles forming a scene or a frame.
Each weight of the foveation areais a score derived from the scores determined across the feature maps after being multiplied by the associated feature weights.
Raw gaze direction predictor 250selects the max of the foveation area as the updated predicted gaze direction.
Foveated Q* gaze predictor 260aggregates, based on a type of tiling scheme (uniform or adaptive), the scores of the raw foveation map into a tile-level foveation map, where a tile includes a plurality of pixels.
the foveation mapincludes a set of foveation weights.
a foveation weightis associated with a tile of the frame.
a foveation weight from the set of foveation weightsis indicative of a resolution at which to encode a tile from one or more tiles forming the frame.
the foveation weightis the average of the pixels it covers from the raw foveation map. This forms the final Q* predicted gaze map 122.
the Q* predicted gaze map 122is then used by the encoder 114 to encode the frame of the media stream into an encoded frame to include encoded tiles of different resolutions. For example, if a given tile has a low score, low-quality encoding is generated for that tile, as the chance of the user looking at that tile is low. Alternatively, when a given tile has a high score, high quality encoding is generated for that tile, as the chance of the user looking at the tile is high.
a privacy-aware few-shot refinementcan be performed.
a lightweight neural networkis initialized with identity weights. These will henceforth be called the refiner networks, as they refine the zeroshot gaze estimation (Q* predicted gaze map) based on real world data.
the refiner networksrefine the zeroshot gaze estimation (Q* predicted gaze map) based on real world data.
the user device 104 AAs the user watches videos, gaze data for that specific user representing their unique viewing patterns is collected.
the Q* predicted gaze maps received from remote electronic device 102are fed into the lightweight refiner network of the user device 104A, whose output are refined Q* predicted gaze maps optimized for the specific user device 104 A.
the architecture of the refiner networkis a lightweight image-to-image dense prediction model, such as a lightweight UNet.
the refiner networkis trained using the collected gaze data using an appropriate objective function, such as minimizing the mean square error.
remote electronic device 102requests from a selection of user devices (e.g., a random selection) for the weights of their refiner networks.
the serverwill then average the weights received and set that as the weights of its own refiner network, which will have the same architecture.
no gaze datais communicated between the user devices and the remote electronic device 102, consequently preserving users’ privacy.
remote electronic device 102first computes the raw zero-shot score, and then passes it through its refiner network to obtain an optimized score, which it will then use for encoding media stream and transmit the encoded media streams to a user device.
Figure 3Aillustrates a flow diagram of exemplary operations performed for determining a set of feature weights for a media stream, in accordance with some embodiments.
the operationscan be performed in a remote electronic device.
the remote electronic deviceselects one or more frames from the media stream. In some embodiments, prior to selecting the one or more frames, the remote electronic device is operative to receive the media stream. In some embodiments, the remote electronic device receives the media stream in an encoded format and is operative to decode the media stream prior to the selection of the frames. In some embodiments, the selection of the frames can be performed as described with respect to the sampler 221. The flow of operations moves to operation 322.
the remote electronic devicedetermines a category for the media stream. In some embodiments, determining the category for the media stream can include operations 322 A and 322B.
the remote electronic devicedetermines based on the set of one or more frames, a camera motion parameter for the media stream. The camera motion parameter for the media stream is indicative of a degree of motion of the camera for the media stream.
the remote electronic devicedetermines based on the set of one or more frames, a clutter parameter for the media stream. The clutter parameter is indicative of a degree of clutter of regions of interests for the media stream.
operations 322, 322A, and 322Bare performed as described with reference to Figure 2 and the category determiner 222, camera motion parameter determiner 222A, and the clutter parameter determiner 222B.
the flow of operationsmoves to operation 323.
the remote electronic devicedetermines, based on the category for the media stream, a set of one or more feature weights.
the categoryincludes a pair of parameters including clutter parameter and camera motion parameter.
Each feature weightis associated with a feature parameter and is indicative of how much to emphasize the effect of that parameter in the determination of a gaze direction. If the category is not used, the determination of feature weight(s) can be done differently.
Figure 3Billustrates a flow diagram of exemplary operations performed for determining a gaze path for the media stream, in accordance with some embodiments. While the embodiments herein are described with respect to operations performed for a frame, the operations can be performed for a scene from the media stream, where a scene includes a portion of a frame, a frame, or one or more frames with similar visual content. In some embodiments, the operations are performed in a remote electronic device.
the remote electronic deviceinitializes the gaze direction for the frame.
the remote electronic devicedetermines one or more feature maps for a frame of the media stream. The flow moves to operation 330.
the remote electronic device 102determines one or more feature maps for each frame of the media stream. In some embodiments, determining one or more feature maps includes operations 331-335.
the remote electronic device 102determines a saliency map for the frame.
remote electronic device 102determines an object detection map for the frame.
remote electronic device 102determines an optical flow map for the frame.
remote electronic device 102determines a center and/or horizon bias map for the frame.
remote electronic device 102determines a contrast and/or brightness map for the frame. The flow moves to operation 335.
remote electronic device 102determines based on the feature weights and the feature maps a gaze direction density map.
the flow of operationsmoves to operation 350.
the remote electronic device 102determines based on the gaze direction density map a tile-level foveation map (e.g., Q* predicted gaze map) that is to be used for encoding the frame of the media stream into an encoded frame that includes encoded tiles of different resolutions.
a tile-level foveation mape.g., Q* predicted gaze map
Figure 4Aillustrates exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments.
Figure 4Billustrates different exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments.
the leftmost column in each Figurerepresents the raw frame
the middle column in each Figureis foveated gaze density prediction by model
the rightmost column in each Figureis ground-truth gaze information used for testing.
the ground-truth gaze informationrepresents where real viewers have looked when shown the same frames based on gaze tracking of the real viewers.
Video of a drone flying over a lake - based on the preprocessing moduleis categorized as having high camera motion and low clutter.
the network node 1004receives user data from the UE 1006 and initiates transmission of the received user data towards the host 1002.
the host 1002receives the user data carried in the transmission initiated by the UE 1006.
a measurement proceduremay be provided for the purpose of monitoring data rate, latency and other factors on which the one or more embodiments improve.
the measurement procedure and/or the network functionality for reconfiguring the OTT connectionmay be implemented in software and hardware of the host 1002 and/or UE 1006.
sensors(not shown) may be deployed in or in association with other devices through which the OTT connection 1050 passes; the sensors may participate in the measurement procedure by supplying values of the monitored quantities exemplified above, or supplying values of other physical quantities from which software may compute or estimate the monitored quantities.
Determining, calculating, obtaining or similar operations described hereinmay be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination.
processing circuitrymay process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination.
computing devicesmay comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components.
a communication interfacemay be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface.
non-computationally intensive functions of any of such componentsmay be implemented in software or firmware and computational
Embodiment 3The method of any of the previous embodiments, wherein the determining a category for the media stream includes: determining, based on the set of one or more frames, a camera motion parameter for the media stream, wherein the camera motion parameter is indicative of a degree of motion of the camera for the media stream.
Embodiment 4The method of any of the previous embodiments, wherein the determining one or more feature maps includes: determining at least one of a saliency map for the frame; an object detection map for the frame; an optical flow map for the frame; a center and/or horizon bias map for the frame; and a contrast and/or brightness map for the frame.
Landscapes
Engineering & Computer Science (AREA)
Multimedia (AREA)
Signal Processing (AREA)
Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Compression Or Coding Systems Of Tv Signals (AREA)