PERCEPTUALLY OPTIMIZED IMMERSIVE VIDEO ENCODING
RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/535,004, filed August 28, 2023, the disclosure of which is hereby incorporated herein by reference in its entirety.
TECHNICAL FIELD
[0002] Embodiments of the invention relate to the field of video encoding; and more specifically, to perceptually-optimized immersive video encoding.
BACKGROUND
[0003] Scanpath prediction involves predicting where a viewer will focus their gaze while viewing a piece of media. In the case of video media, a scanpath is a time-indexed series of points. Some existing scanpath prediction techniques use neural network-based methods for gaze prediction in panoramic videos, where existing gaze data is used to train a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)-based model for this task. Zeroshot scanpath prediction departs from conventional scanpath prediction techniques in assuming no such training data is available. One such zero-shot scanpath prediction technique assumes that the user’s gaze is attracted to specific regions according to a model derived from gravitational mechanics. Another zero-shot approach in the 2D image case uses a multimodal model such as contrastive language-image pretraining (CLIP) is used to provide more informed gaze predictions when text captions are available.
[0004] Saliency prediction involves identifying which areas of a scene, either objects or regions, appear to stand out from their neighboring regions. One such saliency prediction technique integrates information about brightness, contrast, and patterns at different scales to produce a final saliency map. Recently, neural network-based approaches have been outperforming classical models on this task, where training data collected from human subjective experiments and specific training objectives are employed. One neural network-based technique uses deformable convolutions and patch embeddings from vision transformers with consistency losses to find interesting regions. However, saliency maps tend to account for a limited portion of the variance in actual scanpaths.
[0005] Mixed-scale encoding refers to encoding techniques that allow for the encoding of tiles in a frame at differing pixel density and resolutions according to a set of one or more rules. Mixed-scale encoding reduces the bandwidth requirement of video transmission by taking advantage of foveation and reducing the resolution of tiles unlikely to be within the foveation area of a user’s gaze.
[0006] There currently exist certain challenge(s). There currently exist certain challenge(s) with each of Scanpath prediction, saliency predictions, and mixed-scale encoding.
Challenges of Scanpath Prediction Techniques
[0007] Scanpath prediction techniques can include non-deterministic/neural network-based techniques or determining/parametric techniques. The non-deterministic/neural network-based techniques require large amounts of training data, which are expensive to gather at scale and constitute a significant risk to user privacy since gaze data can be linked to media to infer what individuals were focusing on at any given time. These techniques also assume that virtual reality (VR) client devices have gaze tracking capabilities, which is not currently the state-of-market for consumer-grade VR client devices and is unlikely to be so for the foreseeable future given their high cost. In the case of the deterministic/parametric scanpath prediction techniques, the focus of the existing techniques is on high-precision estimation of user’s gaze using one or few features rather than robust estimation of the user’s gaze for the purpose of media encoding. These methods require significant adaptation for usability in encoding pipelines.
Challenges of Saliency Prediction Techniques
[0008] Saliency indicators are useful for baseline estimators of gaze for further refinement, but they are not aligned with goals of gaze estimation for encoding pipelines. Saliency prediction techniques produce wide salience maps that require significant refinement for the estimation of gaze, and the process of transforming them into gaze predictors is non-trivial. Further, these techniques tend to work best on still images, and don’t produce predictions with enough temporal precision for encoding pipeline tasks since they handle temporal dependence broadly without relation to gaze prediction aside from general interesting-ness.
Challenges with Mixed-Scale Encoding
[0009] Mixed-scale encoding provides a method for achieving higher effective content resolutions at a lower bandwidth cost, while also allowing for high-resolution VR experiences at lower bitrates. However, the existing mixed-scale encoding systems rely on a large set of potential content encodings across candidate resolutions and the constant transmission of gaze or headset inertial measuring unit (IMU) data from a client to determine which resolutions to splice together in the encoding pipeline.
SUMMARY
[0010] In one aspect, a method for a remote electronic device encoding a media stream is described. The method includes selecting a set of frames from the media stream. The method may further include determining a category for the media stream based on the selected set of frames. Determining the category may include, based on the selected set of frames, determining a clutter parameter for the media stream, where the clutter parameter is indicative of a degree of clutter of regions of interests for the media stream. Determining the category may include, based on the selected set of frames, determining a camera motion parameter for the media stream, where the camera motion parameter is indicative of a degree of motion of the camera for the media stream. The method further includes determining a set of one or more feature weights. The feature weights may be determined based on the determined category. The method further includes determining one or more feature maps for one of the frames of the media stream. Determining one or more feature maps may include determining at least one of a saliency map for the frame; an object detection map for the frame; an optical flow map for the frame; a center and/or horizon bias map for the frame; and a contrast and/or brightness map for the frame. The method further includes determining a gaze direction density map based on the feature weights and the feature maps. Determining the gaze direction density map may include summing scores across the one or more feature maps after multiplying by the associated feature weights. The gaze direction density map may be a probability density of gaze direction over the frame. The method further includes determining a tile-level foveation map based on the gaze direction density map. The method further includes encoding the frame using the tile- level foveation map into an encoded frame that includes encoded tiles of different resolutions. The different resolutions may include lower resolutions for tiles that have a low score and higher resolution for tiles that have a relatively higher score. The method further includes transmitting the encoded frame to a user device. The method may further include transmitting the tile level foveation map for the media stream to the user device.
[0011] In further aspects, one or more embodiments of a non-transitory computer-readable medium or distributed media containing computer-executable program instructions or code portions stored thereon are disclosed for performing one or more embodiments of the methods of the present invention when executed by a processor entity of an apparatus, an electronic device, or other computing device.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
[0013] Figure 1 illustrates a block diagram of an exemplary system for perceptually optimized immersive video encoding, in accordance with some embodiments. [0014] Figure 2 illustrates a flow diagram of exemplary operations that can be performed to obtain a gaze prediction for a media stream.
[0015] Figure 3A illustrates a flow diagram of exemplary operations performed for determining a set of feature weights for a media stream, in accordance with some embodiments.
[0016] Figure 3B illustrates a flow diagram of exemplary operations performed for determining a gaze path for the media stream, in accordance with some embodiments.
[0017] Figure 4A illustrates exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments.
[0018] Figure 4B illustrates different exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments.
[0019] Figure 5 shows an example of a communication system in accordance with some embodiments.
[0020] Figure 6 shows a UE in accordance with some embodiments.
[0021] Figure 7 shows a network node in accordance with some embodiments.
[0022] Figure 8 is a block diagram of a host, which may be an embodiment of the host of
Figure 5, in accordance with various aspects described herein.
[0023] Figure 9 is a block diagram illustrating a virtualization environment in which functions implemented by some embodiments may be virtualized.
[0024] Figure 10 shows a communication diagram of a host communicating via a network node with a UE over a partially wireless connection in accordance with some embodiments.
DETAILED DESCRIPTION
[0025] Certain aspects of the disclosure and their embodiments may provide solutions to the challenges of gaze estimation techniques or other challenges. The embodiments herein describe a perceptually optimized zero-shot gaze estimation method for a media stream. In some embodiments, the media stream is a 360-degree video. According to these embodiments, a media stream is taken as input and a series of feature extraction, refinement, and estimation techniques are used to generate a set of user gaze likelihood estimates across tiles or subpictures for each frame or set of frames that can be passed onto the encoder to refine the encoding of media. In some embodiments, the feature extraction, refinement, and estimation modules used across different types of media streams are optimized via a pre-processing sequence.
[0026] In additional embodiments, a “few shot” gaze prediction optimization method is presented. This method uses a lightweight few-layer neural network located on the client device. This optimization takes zero-shot gaze prediction and the on-client gaze or headset viewport paths for each frame as inputs and sends back a revised series of weights for a user or small set of users across feature modules to the server (or an intermediate server for weight aggregation) to improve the estimation for any individual or group of users.
[0027] The embodiments described herein present a method using an adaptive consensus algorithm for gaze estimation that calibrates the feature extraction, estimator selection, and aggregation protocol for gaze estimation based on perceptual features of the media (preprocessing module discussed below).
[0028] The embodiments herein describe a media pre-processing module that calibrates the weights to apply when aggregating gaze-prediction estimators using text and optical flow based cues extracted from a sample of frames; an adaptive, highly modular gaze estimation mechanism that takes as input several feature-based estimators of saliency, movement, object classification, and subjective interestingness, weights them via the pre-processing module, and adapts them iteratively to the task of gaze estimation (e.g., through thresholding, dynamic weighting, and foveation); and a protocol that transforms the gaze estimation score into a weighted quality-gaze indicator (Q*) to feed to a media encoder for downstream media transport and processing. [0029] Some embodiments present a client-side federated learning-based solution for providing privacy-preserving refinements to the feature weighting introduced in the gaze estimation protocol based on individual or aggregated user data.
[0030] Certain embodiments may provide one or more of the following technical advantage(s). For example, the embodiments disclosed herein provide an adaptive protocol transforming the information they glean from scanpath prediction methods into useful predictive gaze information to feed to the encoder. Gaze prediction and refinement methods described herein reduce both the set of encodings needed for the process and allow optimizations to be made on which encodings can be stitched in the encoding pipeline several (to several hundred) frames ahead in the streaming process. This method enables mixed-scale encoding methods using gaze prediction and client-based refinement of gaze estimation instead of relying on client gaze or IMU data, which constitutes a significant privacy improvement on existing systems. [0031] The embodiments herein enhance the efficiency of video encoding systems. The embodiments allow for the prediction of gaze without the use of eye-tracking or headsettracking data, which constitutes a significant improvement over neural-network-based methods that require significant training data to operate. Even with such data available, this method provides a preliminary gaze-tracking functionality for media assets for which no gaze data is yet available, which constitutes a significant advantage since it allows for the refinement of the encoding process before the threshold of gaze data needed to train the models is available for new media assets. [0032] The embodiments herein enhance performance of video encoding systems. The embodiments presented herein outperform off-the-shelf saliency and gaze predictors, including neural network-based methods, indicating substantial potential performance advantages. In addition, the modularity of the embodiments provides for the potential of stronger out-of-sample performance via the pre-processing module introduced below.
[0033] The embodiments herein enhance privacy preservation in video encoding systems. The embodiments allow for a more privacy-preserving implementation of gaze prediction by removing the need to collect and offload highly identifiable gaze or headset viewport information to a streaming server. Even in the few-shot implementation below, gaze information is used only on the client-side, thus preserving the privacy of the viewer even when gaze information is used to refine the estimator.
[0034] Some of the embodiments contemplated herein will now be described more fully with reference to the accompanying drawings. Embodiments are provided by way of example to convey the scope of the subject matter to those skilled in the art.
[0035] Figure 1 illustrates a block diagram of an exemplary system for perceptually optimized immersive video encoding, in accordance with some embodiments. System 100 includes one or more user devices 104A-N, remote electronic device 102, and network 105. System 100 is operative to record, generate, encode, decode, and/or display media streams for 2D and/or 3D (or 360°) videos viewing in a user device. By way of illustration, system 100 includes user devices 104A-N. A user device, e.g., user device 104A, can be or include a computer 103 A such as a laptop or a desktop and/or a tablet or smartphone 103B associated with head-mounted displays (HMDs) or headsets 103C. In some embodiments, user device 104A can be a standalone HMD that is operative to connect to a remote electronic device 102 through network 105 without an intermediary electronic device. One or more of user devices 104A-N are operative to receive encoded media streams, decode, and display the media streams. User devices 104A-N are operative to decode and render several types of 360° video content that may be encoded and bandwidth-optimized according to the embodiments described in additional detail below. In some embodiments, one or more of the user devices 104A-N is operative to run a lightweight learning model (for few-shot refinement). For example, the user device includes processing capabilities that allow it to run a lightweight learning model estimating weights across features for a given user experience.
[0036] System 100 further includes remote electronic device 102. Remote electronic device 102 is an electronic device that is remote from a user device, e.g., from user devices 104A-N (e.g., connected to the user device through a wide area network (WAN)). Alternatively, or additionally, remote electronic device 102 connects to the user device through a local area network. Remote electronic device 102 includes optional decoder 131, media streams 112, gaze map determiner 117, encoder 114, and transmitter 121. In some embodiments, the remote electronic device 102 includes a graphics processing unit (GPU). The GPU is operative to perform tensor-level operations and other graphics processing operations to accelerate one or more of the operations described below with respect to the gaze map determiner 117 and/or the encoder 114.
[0037] Decoder 131 is operative to decode and process video inputs. The video decoder may include a pipeline that takes an encoded video stream as input and decodes it into a manipulable/transmittable form. The output of the decoder can include media streams 112. In some embodiments, decoder 131 may not be included as the media stream can be received in a decoded form.
[0038] Remote electronic device 102 further includes encoder 114. Encoder 114 is operative to encode or compress the media stream using a codec according to one or more video encoding formats, e.g., H.264 or Advanced Video Coding (MPEG-4 AVC), High Efficiency Video Coding (HEVC) or H.265 (MPEG-H Part 2), H.262 (MPEG-2), H.264 (MPEG-4, Part 2), Alliance for Open Media (AOMedia) Video 1 (AVI), H.266, Versatile Video Coding (VVC), Future Video Coding (FVC), etc. In some embodiments, encoder 114 is operative to perform tile encoding. In some embodiments, encoder 114 is operative to generate encoded media streams of multiple bitrate representations of an input video stream corresponding to a 360° immersive video asset or program. Each bitrate representation has a certain video quality level and may be encoded to contain frames with appropriately modified tile, frame and/or slice data that optimizes bandwidth, video quality, and/or latency of the media stream’s distribution.
[0039] System 100 is operative to enable predictive mixed-scale encoding of media streams. Gaze map determiner 117 is operative to determine a Q* predicted gaze map. In some embodiments, the Q* predicted gaze map is generated as described in further details with reference to Figures 2-4.
[0040] In some embodiments, an encoded scene is generated from the media stream and based on the Q* predicted gaze map. The encoded scene includes encoded tiles determined based on the set of foveation weights of the Q* predicted gaze map. The encoded scene is transmitted to be displayed on a user device. In some embodiments, the set of foveation weights is a first set of foveation weights and the foveation weight map includes a second set of foveation weights. In these embodiments, the encoded scene further includes second encoded tiles determined based on the second set of foveation weights. The second encoded tiles are of lower resolution than the first encoded tiles. [0041] The operations in the flow diagrams will be described with reference to the exemplary embodiments of the other figures. However, the operations of the flow diagrams can be performed by embodiments of the invention other than those discussed with reference to the other figures, and the embodiments of the invention discussed with reference to these other figures can perform operations different than those discussed with reference to the flow diagrams.
[0042] Figure 2 illustrates a flow diagram of exemplary operations that can be performed to obtain a gaze prediction for a media stream. A media stream 112A is input to the gaze map determiner 117. In some embodiments, the media stream 112A results from the decoding of an encoded video stream. In an exemplary embodiments, the media stream can include a sequence of RGB frames.
[0043] In some embodiments, the media stream is fed to a pre-processing module 220. Preprocessing module 220 is operative to categorize the media stream into one of several categories. Pre-processing module 220 further determines based on the assigned category, a set of one or more weights. Each weight is associated with a feature type. The weights are to be used for combining the feature maps of the media stream. In some embodiments, pre-processing module 220 may include sampler 221. Sampler 221 is operative to select one or more frames from the media stream. In some embodiments, sampler 221 selects a sample of N frames (e.g., 20 or 30 frames) from the plurality of frames that form the media stream. The set of frames is used by the category determiner 222 to determine a category to the media stream. The category determiner 222 may determine the category to assign to or associate with the media stream based on one or more parameters. In some embodiments, determining a category includes determining a pair of parameters including a first and second parameter. The first parameter, which is also referred to as the clutter parameter, is indicative of a degree of clutter or dispersion of regions of interests in a frame. In a non-limiting example, a region of interest can include a person, an object, a type of vegetation or landscape or any other type of object that can be of interest to the viewer in a scene/frame. The second parameter, which is also referred to as the camera motion parameter, is indicative of the degree of camera motion for the media stream. In some embodiment, the category for the media stream includes the pair (camera motion parameter, clutter parameter). In one embodiment, the camera motion parameter is a camera motion class from low camera motion class, medium camera motion class, or high camera motion class; and the clutter motion parameter is a clutter class from low clutter class, medium clutter class, or high clutter class. While three classes are described for each of the parameters, any number of classes can be defined for each one of the parameters without departing from the scope of the embodiments herein. While some embodiments are described where a single category (e.g., pair of camera motion parameter and clutter parameter) is determined for the entire media stream, in other embodiments, multiple categories can be determined for the media stream, where each category is associated with a portion of the media stream (e.g., a scene, a frame, a set of frames, etc.).
[0044] In one embodiment, category determiner 222 determines the clutter parameter by classifying, based on the sample frames, the media stream into one of multiple clutter classes (e.g., low clutter class, medium clutter class, or high clutter class), where each class indicates a different degree of clutter of the region of interests in the media stream. In some embodiments, classifying the media stream is performed according to an object detection model (e.g., YOLO or COCO) to count the average number of interesting objects (e.g. humans) across the samples frame, and classify based on set thresholds on count. In some embodiments, classifying the media stream is performed according to a saliency prediction model (e.g., Itti et al. or PAVER) to obtain a map of regions of interest averaged over sampled frames, and classify based on overall magnitude and dispersion of regions of interest (mean and standard deviation).
[0045] In one embodiment, category determiner 222 determines the camera motion parameter by classifying, based on the sample frames, the media stream into one of multiple camera motion classes (e.g., low camera motion class, medium camera motion class, or high camera motion class), where each class indicates a different degree of camera motion for the media stream. In some embodiments, the degree of camera motion can be determined according to an optical flow model (e.g., RAFT) to obtain the average flow magnitude across the sampled frames and classify based on set thresholds on optical flow. In some embodiments, the degree of camera motion can be determined according to visual Simultaneous Localization and Mapping (SLAM) or SfM (Structure from Motion) model (e.g., SfM-Learner) to estimate camera pose from sampled frames and use rate of change of camera pose for classification.
[0046] The pre-processing module 220 may include the feature weights determiner 223. The feature weights determiner 223 determines, based on the category for the media stream, a set of one or more feature weights for the video stream. Each feature weight is associated with a feature parameter and is indicative of how much to emphasize the effect of that parameter in the determination of a gaze direction. In some embodiments, a feature weight lookup table can be used to determine, based on the category, for each feature from a set of features a corresponding weight. The weight selection may be based on functional relationships between certain types of media and the features that are likely to increase or decrease the relevance of those features for performance on that type of media.
[0047] In some embodiments, the feature weights can be determined according to one or more of the following relationships. When the camera’s motion increases, optical flow is composed mostly of relative motion between camera and surroundings, and not absolute motion of objects. In this case, optical flow may be downweighted, and instead object detection is upweighted to find areas of interest. Further, in the same scenario, a user may have a sense of progressing towards a destination which causes them to focus on the center of the frame. Thus, center bias weight may be highly weighted in cases with high camera motion. When the clutter in the scene increases, the number of interesting points for the viewer to watch increases, which leads to more exploration. Hence, horizon bias and emphasis on object detection maps increases with increase in clutter. When both camera motion and clutter are low, the weight determination may fall back to increasing saliency models weights and simple biases weights. When the optical flow-based method is used to judge camera motion, intermediate levels of optical flow suggest one-to-few moving objects, which humans tend to track with gaze. In such cases, optical flow is upweighted.
[0048] In some embodiments, the set of feature weights can be determined based on the table below, Table 1. Selecting the feature weights based on the category (clutter parameter, camera motion parameter) provides a significant performance improvement over merely adding the features together. This is because each parameter relates to a specific feature of human vision, thereby specifically translating to different weights in the determination of gaze based on underlying features of the image stream.
Table 1
[0049] Table 1 : Values of weights used in each category of content. The tuples in each cell refer to weights for saliency maps (both structural and neural), object detection, optical flow, center bias and/or horizon bias, and contrast/brightness respectively. The categories on the horizontal axis represent camera motion while those on the vertical axis are clutter parameter. In some embodiments, the weights are user modified, optimized based on an objective function or obtained over a series of runs for each feature (an "active learning" approach).
[0050] Gaze map determiner 117 is operative to determine based on the feature weights a gaze path for the media stream. Gaze map determiner 117 initialize gaze direction for a frame. In some embodiments, the gaze direction is initialized at the center of the frame. Gaze map determiner 117 further includes a feature maps determiner 230. Feature maps determiner 230 is operative to determine one or more feature maps for each frame of the media stream. In some embodiments, determining one or more feature maps for the frame includes determining one or more of a saliency map 231 for the frame, an object detection map 232 for the frame, an optical flow map 233 for the frame, a center bias and/or horizon bias map 234 for the frame, and/or a contrast/brightness map 235 for the frame.
[0051] In some embodiments, determining a saliency map 231 is based on structural information - e.g., using model that uses intensity, color, and orientation at multiple scales to define points of interest. In some embodiments, determining a saliency map 231 for the frame includes is performed according to neural networks - using the output of a neural network explicitly trained to predict saliency.
[0052] In some embodiments, determining an object detection map 232 for the frame includes using an object detection model to obtain bounding boxes or instance segmentations of predefined objects of interest (e.g., humans). Using the bounding boxes, a binary mask is constructed where each pixel is labelled positive or negative based on whether it is part of an object of interest or not.
[0053] In some embodiments, determining an optical flow map 233 includes using a model for optical flow, deriving a map representing the motion (e.g., magnitude of motion) of each pixel in a frame.
[0054] In some embodiments, determining center bias includes adding positive bias in the center of the frame (e.g., 2x2 center in a 4x8 tiled frame). This may be uniform (constant) or weighted as per saliency in each quadrant of the frame. In some embodiments, determining horizon bias includes adding a positive bias in the center in the center row of the frame (e.g., a 2x8 horizontal band in a 4x8 tiled frame). Each of these feature map determiners produces a feature frame for each frame of the media stream that is of the same width and height as the frame, with rescaling applied if necessary.
[0055] Combiner 240 sums scores across the feature maps after multiplying by the associated feature weights to obtain a gaze direction density map 241, which is referred to as f . , which can be considered a probability density of gaze direction over the frame. Raw gaze direction predictor 250 determines from the gaze direction density map 241 one or more predicted gaze directions for the frame. A predicted gaze direction is the gaze direction of most viewers when a user views the frame of the media stream. Based on the one or more predicted gaze directions, a foveation area is created. The foveation area can be centered at the predicted gaze direction (when the predicted gaze direction is a single entity). Alternatively, the foveation area can be determined from several regions, each region being centered at one of the predicted gaze directions for the frame. A foveation area includes a set of foveation weights. A weight from the set of foveation weights is indicative of a resolution at which to encode a tile from one or more tiles forming a scene or a frame. Each weight of the foveation area is a score derived from the scores determined across the feature maps after being multiplied by the associated feature weights. Raw gaze direction predictor 250 selects the max of the foveation area as the updated predicted gaze direction.
[0056] The operations described above are repeated for each frame of the media stream to obtain raw foveation maps for the media stream, each raw foveation map associated with a frame from the media stream.
[0057] Foveated Q* gaze predictor 260 aggregates, based on a type of tiling scheme (uniform or adaptive), the scores of the raw foveation map into a tile-level foveation map, where a tile includes a plurality of pixels. The foveation map includes a set of foveation weights. A foveation weight is associated with a tile of the frame. A foveation weight from the set of foveation weights is indicative of a resolution at which to encode a tile from one or more tiles forming the frame. In some embodiments, the foveation weight is the average of the pixels it covers from the raw foveation map. This forms the final Q* predicted gaze map 122.
[0058] Referring back to Figure 1, the Q* predicted gaze map 122 is then used by the encoder 114 to encode the frame of the media stream into an encoded frame to include encoded tiles of different resolutions. For example, if a given tile has a low score, low-quality encoding is generated for that tile, as the chance of the user looking at that tile is low. Alternatively, when a given tile has a high score, high quality encoding is generated for that tile, as the chance of the user looking at the tile is high.
Few-Shot Gaze Optimization (client-side federated learning architecture)
[0059] In some embodiments, when the gaze data is available (e.g., a log of user gaze information), a privacy-aware few-shot refinement can be performed. At both the user device 104 A and the remote electronic device 102, a lightweight neural network is initialized with identity weights. These will henceforth be called the refiner networks, as they refine the zeroshot gaze estimation (Q* predicted gaze map) based on real world data. Whenever a user device 104A requests a media stream, both the encodings and the Q* predicted gaze maps for the media stream are delivered by the remote electronic device via the network 105.
[0060] On the user device 104 A, as the user watches videos, gaze data for that specific user representing their unique viewing patterns is collected. The Q* predicted gaze maps received from remote electronic device 102 are fed into the lightweight refiner network of the user device 104A, whose output are refined Q* predicted gaze maps optimized for the specific user device 104 A. The architecture of the refiner network is a lightweight image-to-image dense prediction model, such as a lightweight UNet. The refiner network is trained using the collected gaze data using an appropriate objective function, such as minimizing the mean square error. Periodically, remote electronic device 102 requests from a selection of user devices (e.g., a random selection) for the weights of their refiner networks. The server will then average the weights received and set that as the weights of its own refiner network, which will have the same architecture. According to these embodiments, no gaze data is communicated between the user devices and the remote electronic device 102, consequently preserving users’ privacy. For further computations of the Q* predicted gaze maps, remote electronic device 102 first computes the raw zero-shot score, and then passes it through its refiner network to obtain an optimized score, which it will then use for encoding media stream and transmit the encoded media streams to a user device.
[0061] Figure 3A illustrates a flow diagram of exemplary operations performed for determining a set of feature weights for a media stream, in accordance with some embodiments. In some embodiments, the operations can be performed in a remote electronic device.
[0062] At operation 321, the remote electronic device selects one or more frames from the media stream. In some embodiments, prior to selecting the one or more frames, the remote electronic device is operative to receive the media stream. In some embodiments, the remote electronic device receives the media stream in an encoded format and is operative to decode the media stream prior to the selection of the frames. In some embodiments, the selection of the frames can be performed as described with respect to the sampler 221. The flow of operations moves to operation 322.
[0063] At operation 322, which is optional in some embodiments, the remote electronic device determines a category for the media stream. In some embodiments, determining the category for the media stream can include operations 322 A and 322B. At operation 322A, the remote electronic device determines based on the set of one or more frames, a camera motion parameter for the media stream. The camera motion parameter for the media stream is indicative of a degree of motion of the camera for the media stream. At operation 322B, the remote electronic device determines based on the set of one or more frames, a clutter parameter for the media stream. The clutter parameter is indicative of a degree of clutter of regions of interests for the media stream. In some embodiments, operations 322, 322A, and 322B are performed as described with reference to Figure 2 and the category determiner 222, camera motion parameter determiner 222A, and the clutter parameter determiner 222B. The flow of operations moves to operation 323. [0064] At operation 323, the remote electronic device determines, based on the category for the media stream, a set of one or more feature weights. The category includes a pair of parameters including clutter parameter and camera motion parameter. Each feature weight is associated with a feature parameter and is indicative of how much to emphasize the effect of that parameter in the determination of a gaze direction. If the category is not used, the determination of feature weight(s) can be done differently. As an example, the feature weight(s) may be deterministically selected, randomly selected, uniformly selected, or learned based on a model. [0065] Figure 3B illustrates a flow diagram of exemplary operations performed for determining a gaze path for the media stream, in accordance with some embodiments. While the embodiments herein are described with respect to operations performed for a frame, the operations can be performed for a scene from the media stream, where a scene includes a portion of a frame, a frame, or one or more frames with similar visual content. In some embodiments, the operations are performed in a remote electronic device.
[0066] At operation 325, the remote electronic device initializes the gaze direction for the frame. At operation 330, the remote electronic device determines one or more feature maps for a frame of the media stream. The flow moves to operation 330. At operation 330, the remote electronic device 102 determines one or more feature maps for each frame of the media stream. In some embodiments, determining one or more feature maps includes operations 331-335. At operation 331, the remote electronic device 102 determines a saliency map for the frame. At operation 332, remote electronic device 102 determines an object detection map for the frame. At operation 333, remote electronic device 102 determines an optical flow map for the frame. At operation 334, remote electronic device 102 determines a center and/or horizon bias map for the frame. At operation 335, remote electronic device 102 determines a contrast and/or brightness map for the frame. The flow moves to operation 335. At operation 340, remote electronic device 102 determines based on the feature weights and the feature maps a gaze direction density map. The flow of operations moves to operation 350. At operation 350, the remote electronic device 102 determines based on the gaze direction density map a tile-level foveation map (e.g., Q* predicted gaze map) that is to be used for encoding the frame of the media stream into an encoded frame that includes encoded tiles of different resolutions.
[0067] Figure 4A illustrates exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments. Figure 4B illustrates different exemplary frames of the media stream and predicted Q* predicted gaze maps generated for those frames, in accordance with some embodiments. The leftmost column in each Figure represents the raw frame, the middle column in each Figure is foveated gaze density prediction by model, and the rightmost column in each Figure is ground-truth gaze information used for testing. The ground-truth gaze information represents where real viewers have looked when shown the same frames based on gaze tracking of the real viewers. Video of a drone flying over a lake - based on the preprocessing module, is categorized as having high camera motion and low clutter. As a result, a high weight is assigned to central bias and saliency, and low to other features. Video of a person walking their dog through a park, such as shown in Figures 4 A and 4B, based on the preprocessing module, may be categorized having medium motion and medium clutter. This frame includes a few moving objects, and human gaze is expected to track them. The weights are balanced among central bias, object detection and optical flow. In another example (not illustrated), when the media stream is a video of a crowd of humans standing in front of the Eiffel tower - based on the preprocessing module, this would be categorized as having low motion and high clutter. In this case, humans are expected to explore the scene in a free-moving way. Thus, horizon bias and object detection receive high weights, with lower weights for the other modules.
[0068] Figure 5 shows an example of a communication system 500 in accordance with some embodiments.
[0069] In the example, the communication system 500 includes a telecommunication network 502 that includes an access network 504, such as a radio access network (RAN), and a core network 506, which includes one or more core network nodes 508. The access network 504 includes one or more access network nodes, such as network nodes 510a and 510b (one or more of which may be generally referred to as network nodes 510), or any other similar 3rd Generation Partnership Project (3GPP) access nodes or non-3GPP access points. Moreover, as will be appreciated by those of skill in the art, a network node is not necessarily limited to an implementation in which a radio portion and a baseband portion are supplied and integrated by a single vendor. Thus, it will be understood that network nodes include disaggregated implementations or portions thereof. For example, in some embodiments, the telecommunication network 502 includes one or more Open-RAN (ORAN) network nodes. An ORAN network node is a node in the telecommunication network 502 that supports an ORAN specification (e.g., a specification published by the O-RAN Alliance, or any similar organization) and may operate alone or together with other nodes to implement one or more functionalities of any node in the telecommunication network 502, including one or more network nodes 510 and/or core network nodes 508.
[0070] Examples of an ORAN network node include an open radio unit (O-RU), an open distributed unit (O-DU), an open central unit (O-CU), including an O-CU control plane (O-CU- CP) or an O-CU user plane (O-CU-UP), a RAN intelligent controller (near-real time or non-real time) hosting software or software plug-ins, such as a near-real time control application (e.g., xApp) or a non-real time control application (e.g., rApp), or any combination thereof (the adjective “open” designating support of an ORAN specification). The network node may support a specification by, for example, supporting an interface defined by the ORAN specification, such as an Al, Fl, Wl, El, E2, X2, Xn interface, an open fronthaul user plane interface, or an open fronthaul management plane interface. Moreover, an ORAN access node may be a logical node in a physical node. Furthermore, an ORAN network node may be implemented in a virtualization environment (described further below) in which one or more network functions are virtualized. For example, the virtualization environment may include an O-Cloud computing platform orchestrated by a Service Management and Orchestration Framework via an 0-2 interface defined by the 0-RAN Alliance or comparable technologies. The network nodes 510 facilitate direct or indirect connection of user equipment (UE), such as by connecting UEs 512a, 512b, 512c, and 512d (one or more of which may be generally referred to as UEs 512) to the core network 506 over one or more wireless connections.
[0071] Example wireless communications over a wireless connection include transmitting and/or receiving wireless signals using electromagnetic waves, radio waves, infrared waves, and/or other types of signals suitable for conveying information without the use of wires, cables, or other material conductors. Moreover, in different embodiments, the communication system 500 may include any number of wired or wireless networks, network nodes, UEs, and/or any other components or systems that may facilitate or participate in the communication of data and/or signals whether via wired or wireless connections. The communication system 500 may include and/or interface with any type of communication, telecommunication, data, cellular, radio network, and/or other similar type of system.
[0072] The UEs 512 may be any of a wide variety of communication devices, including wireless devices arranged, configured, and/or operable to communicate wirelessly with the network nodes 510 and other communication devices. Similarly, the network nodes 510 are arranged, capable, configured, and/or operable to communicate directly or indirectly with the UEs 512 and/or with other network nodes or equipment in the telecommunication network 502 to enable and/or provide network access, such as wireless network access, and/or to perform other functions, such as administration in the telecommunication network 502.
[0073] In the depicted example, the core network 506 connects the network nodes 510 to one or more hosts, such as host 516. These connections may be direct or indirect via one or more intermediary networks or devices. In other examples, network nodes may be directly coupled to hosts. The core network 506 includes one more core network nodes (e.g., core network node 508) that are structured with hardware and software components. Features of these components may be substantially similar to those described with respect to the UEs, network nodes, and/or hosts, such that the descriptions thereof are generally applicable to the corresponding components of the core network node 508. Example core network nodes include functions of one or more of a Mobile Switching Center (MSC), Mobility Management Entity (MME), Home Subscriber Server (HSS), Access and Mobility Management Function (AMF), Session Management Function (SMF), Authentication Server Function (AUSF), Subscription Identifier De-concealing function (SIDE), Unified Data Management (UDM), Security Edge Protection Proxy (SEPP), Network Exposure Function (NEF), and/or a User Plane Function (UPF).
[0074] The host 516 may be under the ownership or control of a service provider other than an operator or provider of the access network 504 and/or the telecommunication network 502, and may be operated by the service provider or on behalf of the service provider. The host 516 may host a variety of applications to provide one or more service. Examples of such applications include live and pre-recorded audio/video content, data collection services such as retrieving and compiling data on various ambient conditions detected by a plurality of UEs, analytics functionality, social media, functions for controlling or otherwise interacting with remote devices, functions for an alarm and surveillance center, or any other such function performed by a server.
[0075] As a whole, the communication system 500 of Figure 5 enables connectivity between the UEs, network nodes, and hosts. In that sense, the communication system may be configured to operate according to predefined rules or procedures, such as specific standards that include, but are not limited to: Global System for Mobile Communications (GSM); Universal Mobile Telecommunications System (UMTS); Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, 5G standards, or any applicable future generation standard (e.g., 6G); wireless local area network (WLAN) standards, such as the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (WiFi); and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave, Near Field Communication (NFC) ZigBee, LiFi, and/or any low- power wide-area network (LPWAN) standards such as LoRa and Sigfox.
[0076] In some examples, the telecommunication network 502 is a cellular network that implements 3GPP standardized features. Accordingly, the telecommunications network 502 may support network slicing to provide different logical networks to different devices that are connected to the telecommunication network 502. For example, the telecommunications network 502 may provide Ultra Reliable Low Latency Communication (URLLC) services to some UEs, while providing Enhanced Mobile Broadband (eMBB) services to other UEs, and/or Massive Machine Type Communication (mMTC)/Massive loT services to yet further UEs. [0077] In some examples, the UEs 512 are configured to transmit and/or receive information without direct human interaction. For instance, a UE may be designed to transmit information to the access network 504 on a predetermined schedule, when triggered by an internal or external event, or in response to requests from the access network 504. Additionally, a UE may be configured for operating in single- or multi-RAT or multi- standard mode. For example, a UE may operate with any one or combination of Wi-Fi, NR (New Radio) and LTE, i.e. being configured for multi-radio dual connectivity (MR-DC), such as E-UTRAN (Evolved- UMTS Terrestrial Radio Access Network) New Radio - Dual Connectivity (EN-DC).
[0078] In the example, the hub 514 communicates with the access network 504 to facilitate indirect communication between one or more UEs (e.g., UE 512c and/or 512d) and network nodes (e.g., network node 510b). In some examples, the hub 514 may be a controller, router, content source and analytics, or any of the other communication devices described herein regarding UEs. For example, the hub 514 may be a broadband router enabling access to the core network 506 for the UEs. As another example, the hub 514 may be a controller that sends commands or instructions to one or more actuators in the UEs. Commands or instructions may be received from the UEs, network nodes 510, or by executable code, script, process, or other instructions in the hub 514. As another example, the hub 514 may be a data collector that acts as temporary storage for UE data and, in some embodiments, may perform analysis or other processing of the data. As another example, the hub 514 may be a content source. For example, for a UE that is a VR headset, display, loudspeaker or other media delivery device, the hub 514 may retrieve VR assets, video, audio, or other media or data related to sensory information via a network node, which the hub 514 then provides to the UE either directly, after performing local processing, and/or after adding additional local content. In still another example, the hub 514 acts as a proxy server or orchestrator for the UEs, in particular if one or more of the UEs are low energy loT devices.
[0079] The hub 514 may have a constant/persistent or intermittent connection to the network node 510b. The hub 514 may also allow for a different communication scheme and/or schedule between the hub 514 and UEs (e.g., UE 512c and/or 512d), and between the hub 514 and the core network 506. In other examples, the hub 514 is connected to the core network 506 and/or one or more UEs via a wired connection. Moreover, the hub 514 may be configured to connect to an M2M service provider over the access network 504 and/or to another UE over a direct connection. In some scenarios, UEs may establish a wireless connection with the network nodes 510 while still connected via the hub 514 via a wired or wireless connection. In some embodiments, the hub 514 may be a dedicated hub - that is, a hub whose primary function is to route communications to/from the UEs from/to the network node 510b. In other embodiments, the hub 514 may be a non-dedicated hub - that is, a device which is capable of operating to route communications between the UEs and network node 510b, but which is additionally capable of operating as a communication start and/or end point for certain data channels.
[0080] Figure 6 shows a UE 600 in accordance with some embodiments. As used herein, a UE refers to a device capable, configured, arranged and/or operable to communicate wirelessly with network nodes and/or other UEs. Examples of a UE include, but are not limited to, a smart phone, mobile phone, cell phone, voice over IP (VoIP) phone, wireless local loop phone, desktop computer, personal digital assistant (PDA), wireless cameras, gaming console or device, music storage device, playback appliance, wearable terminal device, wireless endpoint, mobile station, tablet, laptop, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), smart device, wireless customer-premise equipment (CPE), vehicle, vehicle-mounted or vehicle embedded/integrated wireless device, etc. Other examples include any UE identified by the 3rd Generation Partnership Project (3GPP), including a narrow band internet of things (NB-IoT) UE, a machine type communication (MTC) UE, and/or an enhanced MTC (eMTC) UE.
[0081] A UE may support device-to-device (D2D) communication, for example by implementing a 3 GPP standard for sidelink communication, Dedicated Short-Range Communication (DSRC), vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), or vehicle- to-everything (V2X). In other examples, a UE may not necessarily have a user in the sense of a human user who owns and/or operates the relevant device. Instead, a UE may represent a device that is intended for sale to, or operation by, a human user but which may not, or which may not initially, be associated with a specific human user (e.g., a smart sprinkler controller).
Alternatively, a UE may represent a device that is not intended for sale to, or operation by, an end user but which may be associated with or operated for the benefit of a user (e.g., a smart power meter).
[0082] The UE 600 includes processing circuitry 602 that is operatively coupled via a bus 604 to an input/output interface 606, a power source 608, a memory 610, a communication interface 612, and/or any other component, or any combination thereof. Certain UEs may utilize all or a subset of the components shown in Figure 6. The level of integration between the components may vary from one UE to another UE. Further, certain UEs may contain multiple instances of a component, such as multiple processors, memories, transceivers, transmitters, receivers, etc.
[0083] The processing circuitry 602 is configured to process instructions and data and may be configured to implement any sequential state machine operative to execute instructions stored as machine-readable computer programs in the memory 610. The processing circuitry 602 may be implemented as one or more hardware-implemented state machines (e.g., in discrete logic, field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc.); programmable logic together with appropriate firmware; one or more stored computer programs, general-purpose processors, such as a microprocessor or digital signal processor (DSP), together with appropriate software; or any combination of the above. For example, the processing circuitry 602 may include multiple central processing units (CPUs).
[0084] In the example, the input/output interface 606 may be configured to provide an interface or interfaces to an input device, output device, or one or more input and/or output devices. Examples of an output device include a speaker, a sound card, a video card, a display, a monitor, a printer, an actuator, an emitter, a smartcard, another output device, or any combination thereof. An input device may allow a user to capture information into the UE 600. Examples of an input device include a touch-sensitive or presence-sensitive display, a camera (e.g., a digital camera, a digital video camera, a web camera, etc.), a microphone, a sensor, a mouse, a trackball, a directional pad, a trackpad, a scroll wheel, a smartcard, and the like. The presence-sensitive display may include a capacitive or resistive touch sensor to sense input from a user. A sensor may be, for instance, an accelerometer, a gyroscope, a tilt sensor, a force sensor, a magnetometer, an optical sensor, a proximity sensor, a biometric sensor, etc., or any combination thereof. An output device may use the same type of interface port as an input device. For example, a Universal Serial Bus (USB) port may be used to provide an input device and an output device.
[0085] In some embodiments, the power source 608 is structured as a battery or battery pack. Other types of power sources, such as an external power source (e.g., an electricity outlet), photovoltaic device, or power cell, may be used. The power source 608 may further include power circuitry for delivering power from the power source 608 itself, and/or an external power source, to the various parts of the UE 600 via input circuitry or an interface such as an electrical power cable. Delivering power may be, for example, for charging of the power source 608. Power circuitry may perform any formatting, converting, or other modification to the power from the power source 608 to make the power suitable for the respective components of the UE 600 to which power is supplied.
[0086] The memory 610 may be or be configured to include memory such as random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable readonly memory (EEPROM), magnetic disks, optical disks, hard disks, removable cartridges, flash drives, and so forth. In one example, the memory 610 includes one or more application programs 614, such as an operating system, web browser application, a widget, gadget engine, or other application, and corresponding data 616. The memory 610 may store, for use by the UE 600, any of a variety of various operating systems or combinations of operating systems.
[0087] The memory 610 may be configured to include a number of physical drive units, such as redundant array of independent disks (RAID), flash memory, USB flash drive, external hard disk drive, thumb drive, pen drive, key drive, high-density digital versatile disc (HD-DVD) optical disc drive, internal hard disk drive, Blu-Ray optical disc drive, holographic digital data storage (HDDS) optical disc drive, external mini-dual in-line memory module (DIMM), synchronous dynamic random access memory (SDRAM), external micro-DIMM SDRAM, smartcard memory such as tamper resistant module in the form of a universal integrated circuit card (UICC) including one or more subscriber identity modules (SIMs), such as a USIM and/or ISIM, other memory, or any combination thereof. The UICC may for example be an embedded UICC (eUICC), integrated UICC (iUICC) or a removable UICC commonly known as ‘SIM card.’ The memory 610 may allow the UE 600 to access instructions, application programs and the like, stored on transitory or non-transitory memory media, to off-load data, or to upload data. An article of manufacture, such as one utilizing a communication system may be tangibly embodied as or in the memory 610, which may be or comprise a device-readable storage medium.
[0088] The processing circuitry 602 may be configured to communicate with an access network or other network using the communication interface 612. The communication interface 612 may comprise one or more communication subsystems and may include or be communicatively coupled to an antenna 622. The communication interface 612 may include one or more transceivers used to communicate, such as by communicating with one or more remote transceivers of another device capable of wireless communication (e.g., another UE or a network node in an access network). Each transceiver may include a transmitter 618 and/or a receiver 620 appropriate to provide network communications (e.g., optical, electrical, frequency allocations, and so forth). Moreover, the transmitter 618 and receiver 620 may be coupled to one or more antennas (e.g., antenna 622) and may share circuit components, software or firmware, or alternatively be implemented separately.
[0089] In the illustrated embodiment, communication functions of the communication interface 612 may include cellular communication, Wi-Fi communication, LPWAN communication, data communication, voice communication, multimedia communication, short- range communications such as Bluetooth, near-field communication, location-based communication such as the use of the global positioning system (GPS) to determine a location, another like communication function, or any combination thereof. Communications may be implemented in according to one or more communication protocols and/or standards, such as IEEE 802.11, Code Division Multiplexing Access (CDMA), Wideband Code Division Multiple Access (WCDMA), GSM, LTE, New Radio (NR), UMTS, WiMax, Ethernet, transmission control protocol/intemet protocol (TCP/IP), synchronous optical networking (SONET), Asynchronous Transfer Mode (ATM), QUIC, Hypertext Transfer Protocol (HTTP), and so forth. [0090] Regardless of the type of sensor, a UE may provide an output of data captured by its sensors, through its communication interface 612, via a wireless connection to a network node. Data captured by sensors of a UE can be communicated through a wireless connection to a network node via another UE. The output may be periodic (e.g., once every 15 minutes if it reports the sensed temperature), random (e.g., to even out the load from reporting from several sensors), in response to a triggering event (e.g., when moisture is detected an alert is sent), in response to a request (e.g., a user initiated request), or a continuous stream (e.g., a live video feed of a patient).
[0091] As another example, a UE comprises an actuator, a motor, or a switch, related to a communication interface configured to receive wireless input from a network node via a wireless connection. In response to the received wireless input the states of the actuator, the motor, or the switch may change. For example, the UE may comprise a motor that adjusts the control surfaces or rotors of a drone in flight according to the received input or to a robotic arm performing a medical procedure according to the received input.
[0092] A UE, when in the form of an Internet of Things (loT) device, may be a device for use in one or more application domains, these domains comprising, but not limited to, city wearable technology, extended industrial application and healthcare. Non-limiting examples of such an loT device are a device which is or which is embedded in: a connected refrigerator or freezer, a TV, a connected lighting device, an electricity meter, a robot vacuum cleaner, a voice controlled smart speaker, a home security camera, a motion detector, a thermostat, a smoke detector, a door/window sensor, a flood/moisture sensor, an electrical door lock, a connected doorbell, an air conditioning system like a heat pump, an autonomous vehicle, a surveillance system, a weather monitoring device, a vehicle parking monitoring device, an electric vehicle charging station, a smart watch, a fitness tracker, a head-mounted display for Augmented Reality (AR) or Virtual Reality (VR), a wearable for tactile augmentation or sensory enhancement, a water sprinkler, an animal- or item-tracking device, a sensor for monitoring a plant or animal, an industrial robot, an Unmanned Aerial Vehicle (UAV), and any kind of medical device, like a heart rate monitor or a remote controlled surgical robot. A UE in the form of an loT device comprises circuitry and/or software in dependence of the intended application of the loT device in addition to other components as described in relation to the UE 600 shown in Figure 6. [0093] As yet another specific example, in an loT scenario, a UE may represent a machine or other device that performs monitoring and/or measurements, and transmits the results of such monitoring and/or measurements to another UE and/or a network node. The UE may in this case be an M2M device, which may in a 3 GPP context be referred to as an MTC device. As one particular example, the UE may implement the 3GPP NB-IoT standard. In other scenarios, a UE may represent a vehicle, such as a car, a bus, a truck, a ship and an airplane, or other equipment that is capable of monitoring and/or reporting on its operational status or other functions associated with its operation.
[0094] In practice, any number of UEs may be used together with respect to a single use case. For example, a first UE might be or be integrated in a drone and provide the drone’s speed information (obtained through a speed sensor) to a second UE that is a remote controller operating the drone. When the user makes changes from the remote controller, the first UE may adjust the throttle on the drone (e.g. by controlling an actuator) to increase or decrease the drone’s speed. The first and/or the second UE can also include more than one of the functionalities described above. For example, a UE might comprise the sensor and the actuator, and handle communication of data for both the speed sensor and the actuators.
[0095] Figure 7 shows a network node 700 in accordance with some embodiments. As used herein, network node refers to equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE and/or with other network nodes or equipment, in a telecommunication network. Examples of network nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)), O-RAN nodes or components of an O-RAN node (e.g., O-RU, O-DU, O-CU).
[0096] Base stations may be categorized based on the amount of coverage they provide (or, stated differently, their transmit power level) and so, depending on the provided amount of coverage, may be referred to as femto base stations, pico base stations, micro base stations, or macro base stations. A base station may be a relay node or a relay donor node controlling a relay. A network node may also include one or more (or all) parts of a distributed radio base station such as centralized digital units, distributed units (e.g., in an O-RAN access node) and/or remote radio units (RRUs), sometimes referred to as Remote Radio Heads (RRHs). Such remote radio units may or may not be integrated with an antenna as an antenna integrated radio. Parts of a distributed radio base station may also be referred to as nodes in a distributed antenna system (DAS).
[0097] Other examples of network nodes include multiple transmission point (multi-TRP) 5G access nodes, multi-standard radio (MSR) equipment such as MSR BSs, network controllers such as radio network controllers (RNCs) or base station controllers (BSCs), base transceiver stations (BTSs), transmission points, transmission nodes, multi-cell/multicast coordination entities (MCEs), Operation and Maintenance (O&M) nodes, Operations Support System (OSS) nodes, Self-Organizing Network (SON) nodes, positioning nodes (e.g., Evolved Serving Mobile Location Centers (E-SMLCs)), and/or Minimization of Drive Tests (MDTs).
[0098] The network node 700 includes a processing circuitry 702, a memory 704, a communication interface 706, and a power source 708. The network node 700 may be composed of multiple physically separate components (e.g., a NodeB component and a RNC component, or a BTS component and a BSC component, etc.), which may each have their own respective components. In certain scenarios in which the network node 700 comprises multiple separate components (e.g., BTS and BSC components), one or more of the separate components may be shared among several network nodes. For example, a single RNC may control multiple NodeBs. In such a scenario, each unique NodeB and RNC pair, may in some instances be considered a single separate network node. In some embodiments, the network node 700 may be configured to support multiple radio access technologies (RATs). In such embodiments, some components may be duplicated (e.g., separate memory 704 for different RATs) and some components may be reused (e.g., a same antenna 710 may be shared by different RATs). The network node 700 may also include multiple sets of the various illustrated components for different wireless technologies integrated into network node 700, for example GSM, WCDMA, LTE, NR, WiFi, Zigbee, Z-wave, LoRaWAN, Radio Frequency Identification (RFID) or Bluetooth wireless technologies. These wireless technologies may be integrated into the same or different chip or set of chips and other components within network node 700.
[0099] The processing circuitry 702 may comprise a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application- specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, software and/or encoded logic operable to provide, either alone or in conjunction with other network node 700 components, such as the memory 704, to provide network node 700 functionality.
[0100] In some embodiments, the processing circuitry 702 includes a system on a chip (SOC). In some embodiments, the processing circuitry 702 includes one or more of radio frequency (RF) transceiver circuitry 712 and baseband processing circuitry 714. In some embodiments, the radio frequency (RF) transceiver circuitry 712 and the baseband processing circuitry 714 may be on separate chips (or sets of chips), boards, or units, such as radio units and digital units. In alternative embodiments, part or all of RF transceiver circuitry 712 and baseband processing circuitry 714 may be on the same chip or set of chips, boards, or units. [0101] The memory 704 may comprise any form of volatile or non-volatile computer- readable memory including, without limitation, persistent storage, solid-state memory, remotely mounted memory, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), mass storage media (for example, a hard disk), removable storage media (for example, a flash drive, a Compact Disk (CD) or a Digital Video Disk (DVD)), and/or any other volatile or non-volatile, non-transitory device-readable and/or computer-executable memory devices that store information, data, and/or instructions that may be used by the processing circuitry 702. The memory 704 may store any suitable instructions, data, or information, including a computer program, software, an application including one or more of logic, rules, code, tables, and/or other instructions capable of being executed by the processing circuitry 702 and utilized by the network node 700. The memory 704 may be used to store any calculations made by the processing circuitry 702 and/or any data received via the communication interface 706. In some embodiments, the processing circuitry 702 and memory 704 is integrated.
[0102] The communication interface 706 is used in wired or wireless communication of signaling and/or data between a network node, access network, and/or UE. As illustrated, the communication interface 706 comprises port(s)/terminal(s) 716 to send and receive data, for example to and from a network over a wired connection. The communication interface 706 also includes radio front-end circuitry 718 that may be coupled to, or in certain embodiments a part of, the antenna 710. Radio front-end circuitry 718 comprises filters 720 and amplifiers 722. The radio front-end circuitry 718 may be connected to an antenna 710 and processing circuitry 702. The radio front-end circuitry may be configured to condition signals communicated between antenna 710 and processing circuitry 702. The radio front-end circuitry 718 may receive digital data that is to be sent out to other network nodes or UEs via a wireless connection. The radio front-end circuitry 718 may convert the digital data into a radio signal having the appropriate channel and bandwidth parameters using a combination of filters 720 and/or amplifiers 722. The radio signal may then be transmitted via the antenna 710. Similarly, when receiving data, the antenna 710 may collect radio signals which are then converted into digital data by the radio front-end circuitry 718. The digital data may be passed to the processing circuitry 702. In other embodiments, the communication interface may comprise different components and/or different combinations of components.
[0103] In certain alternative embodiments, the network node 700 does not include separate radio front-end circuitry 718, instead, the processing circuitry 702 includes radio front-end circuitry and is connected to the antenna 710. Similarly, in some embodiments, all or some of the RF transceiver circuitry 712 is part of the communication interface 706. In still other embodiments, the communication interface 706 includes one or more ports or terminals 716, the radio front-end circuitry 718, and the RF transceiver circuitry 712, as part of a radio unit (not shown), and the communication interface 706 communicates with the baseband processing circuitry 714, which is part of a digital unit (not shown).
[0104] The antenna 710 may include one or more antennas, or antenna arrays, configured to send and/or receive wireless signals. The antenna 710 may be coupled to the radio front-end circuitry 718 and may be any type of antenna capable of transmitting and receiving data and/or signals wirelessly. In certain embodiments, the antenna 710 is separate from the network node 700 and connectable to the network node 700 through an interface or port.
[0105] The antenna 710, communication interface 706, and/or the processing circuitry 702 may be configured to perform any receiving operations and/or certain obtaining operations described herein as being performed by the network node. Any information, data and/or signals may be received from a UE, another network node and/or any other network equipment. Similarly, the antenna 710, the communication interface 706, and/or the processing circuitry 702 may be configured to perform any transmitting operations described herein as being performed by the network node. Any information, data and/or signals may be transmitted to a UE, another network node and/or any other network equipment.
[0106] The power source 708 provides power to the various components of network node 700 in a form suitable for the respective components (e.g., at a voltage and current level needed for each respective component). The power source 708 may further comprise, or be coupled to, power management circuitry to supply the components of the network node 700 with power for performing the functionality described herein. For example, the network node 700 may be connectable to an external power source (e.g., the power grid, an electricity outlet) via an input circuitry or interface such as an electrical cable, whereby the external power source supplies power to power circuitry of the power source 708. As a further example, the power source 708 may comprise a source of power in the form of a battery or battery pack which is connected to, or integrated in, power circuitry. The battery may provide backup power should the external power source fail.
[0107] Embodiments of the network node 700 may include additional components beyond those shown in Figure 7 for providing certain aspects of the network node’s functionality, including any of the functionality described herein and/or any functionality necessary to support the subject matter described herein. For example, the network node 700 may include user interface equipment to allow input of information into the network node 700 and to allow output of information from the network node 700. This may allow a user to perform diagnostic, maintenance, repair, and other administrative functions for the network node 700. [0108] Figure 8 is a block diagram of a host 800, which may be an embodiment of the host 516 of Figure 5, in accordance with various aspects described herein. The remote electronic device 102 may be provided by the host 800 in some aspects. As used herein, the host 800 may be or comprise various combinations hardware and/or software, including a standalone server, a blade server, a cloud-implemented server, a distributed server, a virtual machine, container, or processing resources in a server farm. The host 800 may provide one or more services to one or more UEs, such as providing video frames encoded as described here.
[0109] The host 800 includes processing circuitry 802 that is operatively coupled via a bus 804 to an input/output interface 806, a network interface 808, a power source 810, and a memory 812. Other components may be included in other embodiments. Features of these components may be substantially similar to those described with respect to the devices of previous figures, such as Figures 6 and 7, such that the descriptions thereof are generally applicable to the corresponding components of host 800.
[0110] The memory 812 may include one or more computer programs including one or more host application programs 814 and data 816, which may include user data, e.g., data generated by a UE for the host 800 or data generated by the host 800 for a UE. Embodiments of the host 800 may utilize only a subset or all of the components shown. The host application programs 814 may be implemented in a container-based architecture and may provide support for video codecs (e.g., Versatile Video Coding (VVC), High Efficiency Video Coding (HEVC), Advanced Video Coding (AVC), MPEG, VP9) and audio codecs (e.g., FLAC, Advanced Audio Coding (AAC), MPEG, G.711), including transcoding for multiple different classes, types, or implementations of UEs (e.g., handsets, desktop computers, wearable display systems, heads-up display systems). The host application programs 814 may also provide for user authentication and licensing checks and may periodically report health, routes, and content availability to a central node, such as a device in or on the edge of a core network. Accordingly, the host 800 may select and/or indicate a different host for over-the-top services for a UE. The host application programs 814 may support various protocols, such as the HTTP Live Streaming (HLS) protocol, Real-Time Messaging Protocol (RTMP), Real-Time Streaming Protocol (RTSP), Dynamic Adaptive Streaming over HTTP (MPEG-DASH), etc.
[0111] Figure 9 is a block diagram illustrating a virtualization environment 900 in which functions implemented by some embodiments may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components. Some or all of the functions described herein may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 900 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a network node, UE, core network node, or host. Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. In some embodiments, the virtualization environment 900 includes components defined by the O-RAN Alliance, such as an O-Cloud environment orchestrated by a Service Management and Orchestration Framework via an 0-2 interface.
[0112] Applications 902 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment Q400 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein.
[0113] Hardware 904 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 906 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 908a and 908b (one or more of which may be generally referred to as VMs 908), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein. The virtualization layer 906 may present a virtual operating platform that appears like networking hardware to the VMs 908.
[0114] The VMs 908 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 906. Different embodiments of the instance of a virtual appliance 902 may be implemented on one or more of VMs 908, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment.
[0115] In the context of NFV, a VM 908 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Each of the VMs 908, and that part of hardware 904 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 908 on top of the hardware 904 and corresponds to the application 902.
[0116] Hardware 904 may be implemented in a standalone network node with generic or specific components. Hardware 904 may implement some functions via virtualization.
Alternatively, hardware 904 may be part of a larger cluster of hardware (e.g. such as in a data center or CPE) where many hardware nodes work together and are managed via management and orchestration 910, which, among others, oversees lifecycle management of applications 902. In some embodiments, hardware 904 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 912 which may alternatively be used for communication between hardware nodes and radio units.
[0117] Figure 10 shows a communication diagram of a host 1002 communicating via a network node 1004 with a UE 1006 over a partially wireless connection in accordance with some embodiments. Example implementations, in accordance with various embodiments, of the UE (such as a UE 512a of Figure 5 and/or UE 600 of Figure 6), network node (such as network node 510a of Figure 5 and/or network node 700 of Figure 7), and host (such as host 516 of Figure 5 and/or host 800 of Figure 8) discussed in the preceding paragraphs will now be described with reference to Figure 10.
[0118] Like host 800, embodiments of host 1002 include hardware, such as a communication interface, processing circuitry, and memory. The host 1002 also includes software, which is stored in or accessible by the host 1002 and executable by the processing circuitry. The software includes a host application that may be operable to provide a service to a remote user, such as the UE 1006 connecting via an over-the-top (OTT) connection 1050 extending between the UE 1006 and host 1002. In providing the service to the remote user, a host application may provide user data which is transmitted using the OTT connection 1050. [0119] The network node 1004 includes hardware enabling it to communicate with the host 1002 and UE 1006. The connection 1060 may be direct or pass through a core network (like core network 506 of Figure 5) and/or one or more other intermediate networks, such as one or more public, private, or hosted networks. For example, an intermediate network may be a backbone network or the Internet.
[0120] The UE 1006 includes hardware and software, which is stored in or accessible by UE 1006 and executable by the UE’s processing circuitry. The software includes a client application, such as a web browser or operator- specific “app” that may be operable to provide a service to a human or non-human user via UE 1006 with the support of the host 1002. In the host 1002, an executing host application may communicate with the executing client application via the OTT connection 1050 terminating at the UE 1006 and host 1002. In providing the service to the user, the UE’s client application may receive request data from the host's host application and provide user data in response to the request data. The OTT connection 1050 may transfer both the request data and the user data. The UE's client application may interact with the user to generate the user data that it provides to the host application through the OTT connection 1050. [0121] The OTT connection 1050 may extend via a connection 1060 between the host 1002 and the network node 1004 and via a wireless connection 1070 between the network node 1004 and the UE 1006 to provide the connection between the host 1002 and the UE 1006. The connection 1060 and wireless connection 1070, over which the OTT connection 1050 may be provided, have been drawn abstractly to illustrate the communication between the host 1002 and the UE 1006 via the network node 1004, without explicit reference to any intermediary devices and the precise routing of messages via these devices.
[0122] As an example of transmitting data via the OTT connection 1050, in step 1008, the host 1002 provides user data, which may be performed by executing a host application. In some embodiments, the user data is associated with a particular human user interacting with the UE 1006. In other embodiments, the user data is associated with a UE 1006 that shares data with the host 1002 without explicit human interaction. In step 1010, the host 1002 initiates a transmission carrying the user data towards the UE 1006. The host 1002 may initiate the transmission responsive to a request transmitted by the UE 1006. The request may be caused by human interaction with the UE 1006 or by operation of the client application executing on the UE 1006. The transmission may pass via the network node 1004, in accordance with the teachings of the embodiments described throughout this disclosure. Accordingly, in step 1012, the network node 1004 transmits to the UE 1006 the user data that was carried in the transmission that the host 1002 initiated, in accordance with the teachings of the embodiments described throughout this disclosure. In step 1014, the UE 1006 receives the user data carried in the transmission, which may be performed by a client application executed on the UE 1006 associated with the host application executed by the host 1002.
[0123] In some examples, the UE 1006 executes a client application which provides user data to the host 1002. The user data may be provided in reaction or response to the data received from the host 1002. Accordingly, in step 1016, the UE 1006 may provide user data, which may be performed by executing the client application. In providing the user data, the client application may further consider user input received from the user via an input/output interface of the UE 1006. Regardless of the specific manner in which the user data was provided, the UE 1006 initiates, in step 1018, transmission of the user data towards the host 1002 via the network node 1004. In step 1020, in accordance with the teachings of the embodiments described throughout this disclosure, the network node 1004 receives user data from the UE 1006 and initiates transmission of the received user data towards the host 1002. In step 1022, the host 1002 receives the user data carried in the transmission initiated by the UE 1006.
[0124] One or more of the various embodiments improve the performance of OTT services provided to the UE 1006 using the OTT connection 1050, in which the wireless connection 1070 forms the last segment.
[0125] In an example scenario, factory status information may be collected and analyzed by the host 1002. As another example, the host 1002 may process audio and video data which may have been retrieved from a UE for use in creating maps. As another example, the host 1002 may collect and analyze real-time data to assist in controlling vehicle congestion (e.g., controlling traffic lights). As another example, the host 1002 may store surveillance video uploaded by a UE. As another example, the host 1002 may store or control access to media content such as video, audio, VR or AR which it can broadcast, multicast or unicast to UEs. As other examples, the host 1002 may be used for energy pricing, remote control of non-time critical electrical load to balance power generation needs, location services, presentation services (such as compiling diagrams etc. from data collected from remote devices), or any other function of collecting, retrieving, storing, analyzing and/or transmitting data.
[0126] In some examples, a measurement procedure may be provided for the purpose of monitoring data rate, latency and other factors on which the one or more embodiments improve. There may further be an optional network functionality for reconfiguring the OTT connection 1050 between the host 1002 and UE 1006, in response to variations in the measurement results. The measurement procedure and/or the network functionality for reconfiguring the OTT connection may be implemented in software and hardware of the host 1002 and/or UE 1006. In some embodiments, sensors (not shown) may be deployed in or in association with other devices through which the OTT connection 1050 passes; the sensors may participate in the measurement procedure by supplying values of the monitored quantities exemplified above, or supplying values of other physical quantities from which software may compute or estimate the monitored quantities. The reconfiguring of the OTT connection 1050 may include message format, retransmission settings, preferred routing etc.; the reconfiguring need not directly alter the operation of the network node 1004. Such procedures and functionalities may be known and practiced in the art. In certain embodiments, measurements may involve proprietary UE signaling that facilitates measurements of throughput, propagation times, latency and the like, by the host 1002. The measurements may be implemented in that software causes messages to be transmitted, in particular empty or ‘dummy’ messages, using the OTT connection 1050 while monitoring propagation times, errors, etc.
[0127] Although the computing devices described herein (e.g., UEs, network nodes, hosts) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein.
Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware.
[0128] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer- readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device-readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer- readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device, but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally.
[0129] While the flow diagrams in the figures show a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary (e.g., alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc.).
[0130] While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.
[0131] Some embodiments are as described in the following enumerated embodiments.
[0132] Embodiment 1. A method performed by a network node for encoding a media stream, the method comprising: selecting one or more frames from the media stream; determining a category for the media stream based on the one or more frames; determining, based on the category for the media stream, a set of one or more feature weights; determining one or more feature maps for a frame of the media stream; determining based on the feature weights and the feature maps a gaze direction density map; and determining, based on the gaze direction density map, a tile-level foveation map that is to be used to encode the frame of the media stream into an encoded frame that includes encoded tiles of different resolutions.
[0133] Embodiment 2. The method of the previous embodiment, wherein the determining a category for the media stream includes: determining, based on the set of one or more frames, a clutter parameter for the media stream, wherein the clutter parameter is indicative of a degree of clutter of regions of interests for the media stream.
[0134] Embodiment 3. The method of any of the previous embodiments, wherein the determining a category for the media stream includes: determining, based on the set of one or more frames, a camera motion parameter for the media stream, wherein the camera motion parameter is indicative of a degree of motion of the camera for the media stream.
[0135] Embodiment 4. The method of any of the previous embodiments, wherein the determining one or more feature maps includes: determining at least one of a saliency map for the frame; an object detection map for the frame; an optical flow map for the frame; a center and/or horizon bias map for the frame; and a contrast and/or brightness map for the frame.
[0136] Embodiment 5. A network node for encoding a media stream, the network node comprising: processing circuitry configured to perform any of the steps of the preceding embodiments; and power supply circuitry configured to supply power to the processing circuitry. REFERENCES
1. Y. Xu et al., "Gaze Prediction in Dynamic 360° Immersive Videos," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5333-5342, doi: 10.1109/CVPR.2018.00559.
2. D. Zanca, S. Melacci and M. Gori, "Gravitational Laws of Focus of Attention," in IEEE Transactions on Patern Analysis and Machine Intelligence, vol. 42, no. 12, pp. 2983-2995, 1 Dec. 2020, doi: 10.1109/TPAMI.2019.2920636.
3. Zanca, Dario & Zugarini, Andrea & Dietz, Simon & Altstidl, Thomas & Ndjeuha, Mark & Schwinn, Leo & Eskofier, Bjoern. (2023). Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors.
4. L. Itti, C. Koch and E. Niebur, "A model of saliency-based visual attention for rapid scene analysis," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, Nov. 1998, doi: 10.1109/34.730558.
5. Heeseung Yun, Sehun Lee, Gunhee Kim; Panoramic Vision Transformer for Saliency Detection in 360° Videos
6. Tom Foulsham; Geoffrey Underwood; What can saliency models predict about eye movements? Spatial and sequential aspects of fixations during encoding and recognition; Journal of Vision February 2008, Vol.8, 6. doi:https://doi.org/10.1167/8.2.6
7. US Patent No. 10,432,970Bl System and Method for Encoding 360 Degree Immersive Video 2019