BACKGROUNDComputer games and multimedia applications have begun employing cameras and software gesture recognition engines to provide a human computer interface (“HCI”). With HCI, user body parts and movements are detected, interpreted and used to control game characters or other aspects of an application.
One technique for identifying objects such as body parts is computer vision. Some computer vision techniques develop a “classifier” by analyzing one or more example images. As the name implies, an example image is an image that contains one or more examples of the objects that are to be identified. Often, many example images need to be analyzed to adequately develop or “train” the classifier to recognize the object. In some techniques, features are extracted from the example image. Those features which work best to identify the object may be kept for use at run time.
The classifier may later be used during “run time” to identify objects such as body parts. For example, a computer vision system may capture an image in real time, such as a user interacting with a computer system. The computer vision system uses the classifier to identifier objects, such as the hand of the user. In some techniques, the classifier analyzes features that are extracted from the image in order to identify the object.
One difficulty with computer vision is that during run time objects such as body parts could have many possible orientations relative to the camera. For example, the user might have their hand rotated at virtually any angle relative to the camera. Note that for some techniques the features that are extracted are not invariant to the possible orientations of the object. For example, the features may not be invariant to possible rotations of a user's hand.
To account for the multitude of possible rotations of the object (e.g., hand), the example images that are used to build the classifier could theoretically contain many different rotations. For example, example images that show a multitude of possible rotations of a hand could be used to train the classifier. At one extreme, if the example images do not contain enough possible rotations, then the accuracy of the classifier may be poor. At the other extreme, containing a multitude of rotations in the example images may lead to an overly complex classifier, which may result in slow processing speed and high memory usage at run-time For example, the features that work well for one rotation may not work well for another rotation. This may result in the classifier needing to be able to account for all of the possible rotations.
SUMMARYTechnology is described for determining and using features that may be used to identify objects using computer vision. The features may be invariant to various orientations of the object to be identified relative to the camera. For example, the features may be rotation invariant. Therefore, fewer example images may be needed to train the classifier to recognize the object. Consequently, the complexity of the classifier may be simplified without sacrificing accuracy during run time. Techniques may be used to identify objects at run time using computer vision with the use of rotation invariant features.
One embodiment includes a method of processing a depth map that includes the following. A depth map that includes depth pixels is accessed. The depth map is associated with an image coordinate system having a plane. A local orientation for each depth pixel in a subset of the depth pixels is estimated. The local orientation is one or both of an in-plane orientation and an out-out-plane orientation relative to the plane of the image coordinate system. A local coordinate system for each of the depth pixels in the subset is determined. Each local coordinate system is based on the local orientation of the corresponding depth pixel. A feature region is defined relative to the local coordinate system for each of the depth pixels in the subset. The feature region for each of the depth pixels in the subset is transformed from the local coordinate system to the image coordinate system. The transformed feature regions are used to process the depth map. The depth map may be processed at either training time or run time.
One embodiment includes system comprising a depth camera and logic coupled to the depth camera. The depth camera is for generating depth maps that includes a plurality of depth pixels. Each pixel has a depth value, and each depth map is associated with a 2D image coordinate system. The logic is operable to access a depth map from the depth camera; the depth map is associated with an image coordinate system having a plane. The logic is operable to estimate a local orientation for each depth pixel in a subset of the depth pixels. The local orientation includes one or both of an in-plane orientation that is in the plane of the 2D image coordinate system and an out-out-plane orientation that is out-of-the plane of the 2D image coordinate system. The logic is operable to define a local 3D coordinate system for each of the depth pixels in the subset, each local 3D coordinate system is based on the local orientation of the corresponding depth pixel. The logic is operable to define a feature region relative to the local coordinate system for each of the depth pixels in the subset. The logic is operable to transform the feature region for each of the depth pixels in the subset from the local 3D coordinate system to the 2D image coordinate system. The logic is operable to identify an object in the depth map based on the transformed feature regions.
One embodiment is a computer readable storage medium having instructions stored thereon which, when executed on a processor, cause the processor to perform the following steps. A depth map that includes an array of depth pixels is accessed. Each depth pixel has a depth value, and the depth map is associated with a 2D image coordinate system. A local orientation for each depth pixel in a subset of the depth pixels is determined. The local orientation includes in-plane orientation that is in the plane of the 2D image coordinate system and an out-out-plane orientation that is out-of-the plane of the 2D image coordinate system. A 3D model for the depth map is determined. The model includes 3D points that are based on the depth pixels, each of the points has a corresponding depth pixel. A local 3D coordinate system is defined for each of the plurality of points, each local 3D coordinate system is based on the position and local orientation of the corresponding depth pixel. Feature test points are determined relative to the local coordinate system for each of the points. The feature test points are transformed from the local 3D coordinate system to the 2D image coordinate system for each of the feature test points. An object is identified in the depth map based on the transformed feature test points.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 depicts one embodiment of a target detection and tracking system tracking a user.
FIG. 2 depicts one embodiment of a target detection and tracking system.
FIG. 3A is a flowchart of one embodiment of a process of training a machine learning classifier using invariant features.
FIG. 3B is a flowchart that describes a process of using invariant features to identify objects using computer vision.
FIG. 4A depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated based on edges, in accordance with one embodiment.
FIG. 4B depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated based on edges, in accordance with one embodiment.
FIG. 4C is a flowchart of one embodiment of a process of assigning angles to depth pixels based on edges.
FIG. 4D depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated based on medial axes, in accordance with one embodiment.
FIG. 4E depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated based on medial axes, in accordance with one embodiment.
FIG. 4F is a flowchart of one embodiment of a process of assigning angles to depth pixels based on medial axes.
FIG. 5 is a flowchart of one embodiment of a process estimating local orientation of depth pixels for out-of-plane orientation.
FIG. 6A andFIG. 6B depict different rotations of a point cloud model with one embodiment of a local coordinate system.
FIG. 7 depicts a 2D image coordinate system and a 3D local coordinate system used in various embodiments, with a corresponding feature window in each coordinate system.
FIG. 8 is a flowchart of one embodiment of a process of establishing a local in-plain and/or out-of-plane orientation for a depth pixel.
FIG. 9 illustrates an example of a computing environment in accordance with embodiments of the present disclosure.
FIG. 10 illustrates an example of a computing environment in accordance with embodiments of the present disclosure.
DETAILED DESCRIPTIONTechnology is described for developing and using features that may be used to automatically identify objects using computer vision. The features may be rotation invariant. The features may also be translation invariant and/or scale invariant. In one embodiment, the features are in-plane rotation invariant. In one embodiment, the features are out-of-plane rotation invariant. In one embodiment, the features are both in-plane and out-of-plane rotation invariant. By being invariant to transformation such as rotation, the training data requirements and the memory and processing requirements of the classifier can be reduced without adversely affecting test accuracy.
In some embodiments, the invariant features are used in a motion capture system having a capture device. For example, rotation invariant features may be used to identify a user's hand such that the hand can be tracked. One example application is to determine gestures made by the user to allow the user to interact with the system. Therefore, an example motion capture system will be described. However, it will be understood that technology described herein is not limited to a motion capture system.
FIG. 1 depicts an example of amotion capture system10 in which a person interacts with an application. Themotion capture system10 includes adisplay96, acapture device20, and a computing environment orapparatus12. Thecapture device20 may include animage camera component22 having alight transmitter24,light receiver25, and a red-green-blue (RGB)camera28. In one embodiment, thelight transmitter24 emits a collimated light beam. Examples of collimated light include, but are not limited to, Infrared (IR) and laser. In one embodiment, thelight transmitter24 is an LED. Light that reflects off from anobject8 in the field of view is detected by thelight receiver25.
A user, also referred to as a person or player, stands in a field ofview6 of thecapture device20.Lines2 and4 denote a boundary of the field ofview6. In this example, thecapture device20, andcomputing environment12 provide an application in which anavatar97 on thedisplay96 track the movements of the object8 (e.g., a user). For example, theavatar97 may raise an arm when the user raises an arm. Theavatar97 is standing on aroad98 in a 3-D virtual world. A Cartesian world coordinate system may be defined which includes a z-axis which extends along the focal length of thecapture device20, e.g., horizontally, a y-axis which extends vertically, and an x-axis which extends laterally and horizontally. Note that the perspective of the drawing is modified as a simplification, as thedisplay96 extends vertically in the y-axis direction and the z-axis extends out from thecapture device20, perpendicular to the y-axis and the x-axis, and parallel to a ground surface on which the user stands.
Generally, themotion capture system10 is used to recognize, analyze, and/or track an object. Invariant features (e.g., rotation invariant) that are developed in accordance to embodiments can be used in themotion capture system10. Thecomputing environment12 can include a computer, a gaming system or console, or the like, as well as hardware components and/or software components to execute applications.
Thecapture device20 may include a camera which is used to visually monitor one ormore objects8, such as the user, such that gestures and/or movements performed by the user may be captured, analyzed, and tracked to perform one or more controls or actions within an application, such as animating an avatar or on-screen character or selecting a menu item in a user interface (UI). A gesture may be dynamic, comprising a motion, such as mimicking throwing a ball. A gesture may be a static pose, such as holding one's forearms crossed. A gesture may also incorporate props, such as swinging a mock sword.
Some movements of theobject8 may be interpreted as controls that may correspond to actions other than controlling an avatar. For example, in one embodiment, the player may use movements to end, pause, or save a game, select a level, view high scores, communicate with a friend, and so forth. The player may use movements to select the game or other application from a main user interface, or to otherwise navigate a menu of options. Thus, a full range of motion of theobject8 may be available, used, and analyzed in any suitable manner to interact with an application.
The person can hold an object such as a prop when interacting with an application. In such embodiments, the movement of the person and the object may be used to control an application. For example, the motion of a player holding a racket may be tracked and used for controlling an on-screen racket in an application which simulates a tennis game. In another example embodiment, the motion of a player holding a toy weapon such as a plastic sword may be tracked and used for controlling a corresponding weapon in the virtual world of an application which provides a pirate ship.
Themotion capture system10 may further be used to interpret target movements as operating system and/or application controls that are outside the realm of games and other applications which are meant for entertainment and leisure. For example, virtually any controllable aspect of an operating system and/or application may be controlled by movements of theobject8.
Themotion capture system10 may be connected to an audiovisual device such as thedisplay96, e.g., a television, a monitor, a high-definition television (HDTV), or the like, or even a projection on a wall or other surface, that provides a visual and audio output to the user. An audio output can also be provided via a separate device. To drive the display, thecomputing environment12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that provides audiovisual signals associated with an application. Thedisplay96 may be connected to thecomputing environment12 via, for example, an S-Video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, or the like.
FIG. 2 illustrates one embodiment of a target detection andtracking system10 including acapture device20 andcomputing environment12 that may be used to recognize human and non-human targets in a capture area (with or without special sensing devices attached to the subjects), uniquely identify them, and track them in three dimensional space. In one embodiment, thecapture device20 may be a depth camera (or depth sensing camera) configured to capture video with depth information including a depth map that may include depth values via any suitable technique including, for example, time-of-flight, structured light, stereo image, or the like. In one embodiment, thecapture device20 may include a depth sensing image sensor. In one embodiment, thecapture device20 may organize the calculated depth information into “Z layers,” or layers that may be perpendicular to a Z-axis extending from the depth camera along its line of sight.
As shown inFIG. 2, thecapture device20 may include animage camera component32. In one embodiment, theimage camera component32 may be a depth camera that may capture a depth map of a scene. The depth map may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value such as a distance in, for example, centimeters, millimeters, or the like of an object in the captured scene from the camera. Theimage camera component32 may be pre-calibrated to obtain estimates of camera intrinsic parameters such as focal length, principal point, lens distortion parameters etc. Techniques for camera calibration are discussed in, Z. Zhang, “A Flexible New Technique for Camera Calibration,”IEEE Transactions on Pattern Analysis and Machine Intelligence,22(11):1330-1334, 2000, which is hereby incorporated by reference.
As shown inFIG. 2, theimage camera component32 may include anIR light component34, a three-dimensional (3-D)camera36, and anRGB camera38 that may be used to capture the depth map of a capture area. For example, in time-of-flight analysis, theIR light component34 of thecapture device20 may emit an infrared light onto the capture area and may then use sensors to detect the backscattered light from the surface of one or more targets and objects in the capture area using, for example, the 3-D camera36 and/or theRGB camera38. In some embodiment,capture device20 may include an IR CMOS image sensor. In some embodiments, pulsed infrared light may be used such that the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from thecapture device20 to a particular location on the targets or objects in the capture area. Additionally, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine a phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.
In one embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from thecapture device20 to a particular location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.
In another example, thecapture device20 may use structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as grid pattern or a stripe pattern) may be projected onto the capture area via, for example, theIR light component34. Upon striking the surface of one or more targets (or objects) in the capture area, the pattern may become deformed in response. Such a deformation of the pattern may be captured by, for example, the 3-D camera36 and/or theRGB camera38 and analyzed to determine a physical distance from the capture device to a particular location on the targets or objects.
In some embodiments, two or more different cameras may be incorporated into an integrated capture device. For example, a depth camera and a video camera (e.g., an RGB video camera) may be incorporated into a common capture device. In some embodiments, two or more separate capture devices may be cooperatively used. For example, a depth camera and a separate video camera may be used. When a video camera is used, it may be used to provide target tracking data, confirmation data for error correction of target tracking, image capture, face recognition, high-precision tracking of fingers (or other small features), light sensing, and/or other functions.
In one embodiment, thecapture device20 may include two or more physically separated cameras that may view a capture area from different angles to obtain visual stereo data that may be resolved to generate depth information. Depth may also be determined by capturing images using a plurality of detectors that may be monochromatic, infrared, RGB, or any other type of detector and performing a parallax calculation. Other types of depth map sensors can also be used to create a depth map.
As shown inFIG. 2,capture device20 may include amicrophone40. Themicrophone40 may include a transducer or sensor that may receive and convert sound into an electrical signal. In one embodiment, themicrophone40 may be used to reduce feedback between thecapture device20 and thecomputing environment12 in the target detection andtracking system10. Additionally, themicrophone40 may be used to receive audio signals that may also be provided by the user to control applications such as game applications, non-game applications, or the like that may be executed by thecomputing environment12.
Thecapture device20 may includelogic42 that is in communication with theimage camera component22. Thelogic42 may include a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions. Thelogic42 may also include hardware such as an ASIC, electronic circuitry, logic gates, etc. In the event that thelogic42 is a processor, theprocessor42 may execute instructions that may include instructions for storing profiles, receiving the depth map, determining whether a suitable target may be included in the depth map, converting the suitable target into a skeletal representation or model of the target, or any other suitable instructions.
It is to be understood that at least some target analysis and tracking operations may be executed by processors contained within one or more capture devices. A capture device may include one or more onboard processing units configured to perform one or more target analysis and/or tracking functions. Moreover, a capture device may include firmware to facilitate updating such onboard processing logic.
As shown inFIG. 2, thecapture device20 may include amemory component44 that may store the instructions that may be executed by theprocessor42, images or frames of images captured by the 3-D camera or RGB camera, user profiles or any other suitable information, images, or the like. In one example, thememory component44 may include random access memory (RAM), read only memory (ROM), cache, Flash memory, a hard disk, or any other suitable storage component. Thememory component44 may also be referred to as a computer storage medium. As shown inFIG. 2, thememory component44 may be a separate component in communication with theimage capture component32 and theprocessor42. In another embodiment, thememory component44 may be integrated into theprocessor42 and/or theimage capture component32. In one embodiment, some or all of thecomponents32,34,36,38,40,42 and44 of thecapture device20 illustrated inFIG. 2 are housed in a single housing.
As shown inFIG. 2, thecapture device20 may be in communication with thecomputing environment12 via acommunication link46. Thecommunication link46 may be a wired connection including, for example, a USB connection, a Firewire connection, an Ethernet cable connection, or the like and/or a wireless connection such as a wireless 802.11b, g, a, or n connection. Thecomputing environment12 may provide a clock to thecapture device20 that may be used to determine when to capture, for example, a scene via thecommunication link46.
In one embodiment, thecapture device20 may provide the depth information and images captured by, for example, the 3-D camera36 and/or theRGB camera38 to thecomputing environment12 via thecommunication link46. Thecomputing environment12 may then use the depth information and captured images to, for example, create a virtual screen, adapt the user interface and control an application such as a game or word processor.
As shown inFIG. 2, computingenvironment12 includesgestures library192,structure data198,gesture recognition engine190, depth map processing andobject reporting module194, andoperating system196. Depth map processing andobject reporting module194 uses the depth maps to track the motion of objects, such as the user and other objects. To assist in the tracking of the objects, depth map processing andobject reporting module194 usesgestures library190,structure data198 andgesture recognition engine190. In some embodiments, the depth map processing andobject reporting module194 uses aclassifier195 and afeature library199 to identify objects. Thefeature library199 may contain invariant features, such as rotation invariant features.
In one example,structure data198 includes structural information about objects that may be tracked. For example, a skeletal model of a human may be stored to help understand movements of the user and recognize body parts. In another example, structural information about inanimate objects, such as props, may also be stored to help recognize those objects and help understand movement.
In one example, gestureslibrary192 may include a collection of gesture filters, each comprising information concerning a gesture that may be performed by the skeletal model. Agesture recognition engine190 may compare the data captured bycapture device20 in the form of the skeletal model and movements associated with it to the gesture filters in thegesture library192 to identify when a user (as represented by the skeletal model) has performed one or more gestures. Those gestures may be associated with various controls of an application. Thus, thecomputing environment12 may use thegesture recognition engine190 to interpret movements of the skeletal model and to controloperating system196 or an application based on the movements.
In one embodiment, depth map processing andobject reporting module194 will report tooperating system196 an identification of each object detected and the position and/or orientation of the object for each frame.Operating system196 will use that information to update the position or movement of an object (e.g., an avatar) or other images in the display or to perform an action on the provided user-interface.
FIG. 3A is a flowchart of one embodiment of aprocess350 of training a machine learning classifier using invariant features. The features may be invariant to any combination of rotation, translation, and scaling. Rotation invariant features include in-plane and/or out-of-plane invariance.Process350 may involve use of acapture device20. Theprocess350 may create the machine learning classifier that is later used at run time to identify objects.
Instep352, one or more example depth maps (or depth images) are accessed. These images may have been captured by acapture device20. These depth maps may be labeled such that each depth pixel has been classified, for instance manually, or procedurally using computer generated imagery (CGI). For example, each depth pixel may be manually or procedurally classified as being part of a finger, hand, torso, specific segment of a body, etc. The labeling of the depth pixels may involve a person studying the depth map and assigning a label to each pixel, or assigning a label to a group of pixels. The labels might instead be continuous in a regression problem. For example, one might label each pixel with a distance to nearby body joints. Note that because theprocess350 may use rotation invariant features to train the classifier, the number of example depth maps may be kept fairly low. For example, it may not be necessary to provide example images which show a hand (or other object) in a wide variety of rotations.
Instep354, canonical features are computed using an invariant feature transform. Briefly, each labeled example image may be processed in order to extract rotation-invariant features. In one embodiment, a local coordinate system is defined for any given pixel using a combination of in-plane and out-of-plane orientation estimates, and depth. This local coordinate system may be used to transform a feature window prior to computing the features to achieve rotation invariance. The result ofstep354 may be a set of canonical features. Step354 will be discussed in more detail with respect toFIG. 3B. Instep356, class labels (or continuous regression labels) are assigned to corresponding features based on the pixel labels in the example images.
Instep358, the canonical features and corresponding labels are passed to a machine learning classification system to train aclassifier195. Note that this is performed after the transformation ofstep354. Therefore, the features may be rotation invariant. Ifstep354 determined both in-plane and out-of-plane orientations, then the features may be both in-plane and out-of-plane invariant. Ifstep354 determined only in-plane orientations, then the features may be in-plane rotation invariant. Ifstep354 determined only out-of-plane orientations, then the features may be out-of-plane rotation invariant. Theclassifier195 may be used at run-time to classify rotationally-normalized features extracted from new input images. The features may also be invariant to translation and/or scaling. In some embodiments, features that are determined to be useful at identifying objects are saved, such that they may be stored in afeature library199 for use at run time.
FIG. 3B is a flowchart that describes aprocess300 of using invariant features to identify objects using computer vision. The features may be rotation invariant. Rotation invariant features include in-plane rotation invariant, out-of-plane rotation invariant, or both. The features may also be invariant to translation and/or scaling.Process300 may be performed when a user is interacting with amotion capture system10. Thus,process300 could be used in a system such as depicted inFIG. 1 or2.Process300 may be used in a wide variety of other computer vision scenarios.
Instep302, a depth map is accessed. Thecapture device20 may be used to capture the depth map. The depth map may include depth pixels. The depth map may be associated with an image coordinate system. For example, each depth pixel may have two coordinates (u, v) and a depth value. The depth map may be considered to be in a plane that is defined by the two coordinates (u, v). This plane may be based on the orientation of the depth camera and may be referred to herein as an imaging plane. If an object in the camera's field of view moves, it may be described as moving in-plane, out-of-plane or both. For example, rotating movement in the u, v plane (with points on the object retaining their depth values) may be referred to as in-plane rotation (axis of rotation is orthogonal to the u, v plane). Rotating movement that causes changes in depth values at different rates for different points on the object may be referred to as out-of-plane rotation. For example, rotation of a hand with the palm facing the camera is one example of in-plane rotation. Rotation of a hand with the thumb pointing towards and then away from the camera is one example of out-of-plane rotation.
Instep304, the depth map is filtered. In one embodiment, the depth map may be undistorted to remove the distortion effects from the lens. In other embodiments, upon receiving the depth map, the depth map may be down-sampled to a lower processing resolution such that the depth map may be more easily used and/or more quickly processed with less computing overhead. Additionally, one or more high-variance and/or noisy depth values may be removed and/or smoothed from the depth map and portions of missing and/or removed depth information may be filled in and/or reconstructed.
Instep306, the acquired depth map may be processed to distinguish foreground pixels from background pixels. Foreground pixels may be associated with some object (or objects) of interest to be analyzed. As used herein, the term “background” is used to describe anything in an image that is not part of the one objects of interest. For ease of discussion, a single object will be referred to when discussingprocess300.Process300 analyzes pixels in that object of interest. These pixels will be referred to as a subset of the pixels in the depth map.
Steps308-316 describe processing individual pixels associated with the object of interest. In general, these steps involve performing an invariant feature transform. For example, this may be a rotation invariant transform. The transform may also be invariant to translation and/or scale. Note that steps308-316 are one embodiment ofstep354 fromFIG. 3A.
Instep308, a determination is made whether there are more pixels in the subset to process. If so, processing continues withstep310 with one of the depth pixels. Instep310, a local orientation of the depth pixel is estimated. In one embodiment, the local orientation is an in-plane orientation. In one embodiment, the local orientation is an out-out-plane orientation. In one embodiment, the local orientation is both an in-plane orientation and an out-of-plane orientation. Further details of estimating a local orientation are discussed below.
Instep312, a local coordinate system is defined for the depth pixel. In some embodiments, the local coordinate system is a 3D coordinate system. The local coordinate system is based on the local orientation of the depth pixel. For example, if the user's hand moves, rotates, etc., then the local coordinate system moves with the hand. Further details of defining a local coordinate system are discussed below.
Instep314, a feature region is defined relative to the local coordinate system for the presently selected depth pixel. For example, a feature window is defined with its center at the depth pixel. One or more feature test points, feature test rectangles, Haar wavelets, or other such features may be defined based on the geometry of the feature window.
Instep316, the feature region is transformed from the local coordinate system to the image coordinate system. Further details of performing the transform are discussed below. Note that this may involve a transformation from the 3D space of the local coordinate system to a 2D space of the depth map.
Processing then returns to step308 to determine if there are more depth pixels to analyze. If not, then processing continues atstep318. Instep318, the transformed feature regions are used to attempt to identify one or more objects in the depth map. For example, an attempt is made to identify a user's hand. This attempt may include classifying each pixel. For example, each pixel may be assigned a probability that it is part of a hand, head, arm, certain segment of an arm, etc.
In one embodiment, a decision tree is used to classify pixels. Such analysis can determine a best-guess of a target assignment for that pixel and the confidence that the best-guess is correct. In some embodiments, the best-guess may include a probability distribution over two or more possible targets, and the confidence may be represented by the relative probabilities of the different possible targets. In other embodiments the best-guess may include a spatial distribution over 3D offsets to body or hand joint positions. At each node of a decision tree, an observed depth value comparison between two pixels is made, and, depending on the result of the comparison, a subsequent depth value comparison between two pixels is made at the child node of the decision tree. The result of such comparisons at each node determines the pixels that are to be compared at the next node. The terminal nodes of each decision tree results in a target classification or regression with associated confidence.
In some embodiments, subsequent decision trees may be used to iteratively refine the best-guess of the one or more target assignments for each pixel and the confidence that the best-guess is correct. For example, once the pixels have been classified with the first classifier tree (based on neighboring depth values), a refining classification may be performed to classify each pixel by using a second decision tree that looks at the previous classified or regressed pixels and/or depth values. A third pass may also be used to further refine the classification or regression of the current pixel by looking at the previous classified or regressed pixels and/or depth values. It is to be understood that virtually any number of iterations may be performed, with fewer iterations resulting in less computational expense and more iterations potentially offering more accurate classifications or regressions, and/or confidences.
In some embodiments, the decision trees may have been constructed during a training mode in which the example images were analyzed to determine the questions (i.e., tests) that can be asked at each node of the decision trees in order to produce accurate pixel classifications. In one embodiment, foreground pixel assignment is stateless, meaning that the pixel assignments are made without reference to prior states (or prior image frames). One example of a stateless process for assigning probabilities that a particular pixel or group of pixels represents one or more objects is the Exemplar process. The Exemplar process uses a machine-learning approach that takes a depth map and classifies each pixel by assigning to each pixel a probability distribution over the one or more objects to which it could correspond. For example, a given pixel, which is in fact a tennis racquet, may be assigned a 70% chance that it belongs to a tennis racquet, a 20% chance that it belongs to a ping pong paddle, and a 10% chance that it belongs to a right arm. Further details of using decision trees are discussed in US Patent Application Publication 2010/0278384, titled “Human Body Pose Estimation,” by Shotton et al., published on Nov. 4, 2010, which is hereby incorporated by reference. Note that it is not required that decision trees be used. Another technique that may be used to classify pixels is a Support Vector Machine (SVM). Step318 may include using a classifier that was developed during a training session such as that ofFIG. 3A.
As discussed above, part of step354 (of bothFIGS. 3A and 3B) is to estimate a local orientation of depth pixels.FIGS. 4A-4F will be referred to in order to discuss estimating a location local orientation of depth pixels with respect to the (u, v) coordinate system of the depth map. In these examples, the depth values are not factored in to the local orientation. Therefore, this may be considered to be an in-plane orientation.
FIG. 4A depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated, in accordance with one embodiment. Each depth pixel is assigned a value between 0-360 degrees, in this embodiment. The assignment is made such that if the object is rotated in-plane (e.g., in the (u, v) image plane) the depth pixel will have the same local orientation, or at least very close to the same value. For example, the depth pixel may have the same angle assigned to it regardless of rotation in the (u, v) image plane.
Note that the angle is with respect to any convenient reference axis. As one example, the depth map has a u-axis and a v-axis. The angle may be with respect to either axis, or some other axis. Two example depth pixels p1, p2 are shown. Two points q1, q2 are also depicted. The point q is the nearest point on the edge of the hand to the given depth pixel. A line is depicted from p to q. The angle θ is the angle of that line to the u-axis (or more precisely to a line that runs parallel to the u-axis). Note that if the hand were to be rotated in the (u, v) plane, that the angle θ would change by the same amount for all pixels. Therefore, the angle θ serves as a way of describing a local orientation of a depth pixel that is in-plane rotation invariant.
FIG. 4B depicts a depth map of an object for which in-plane local orientation of depth pixels has been estimated, in accordance with one embodiment. This embodiment uses a different technique for determining the angle than the embodiment ofFIG. 4A. In this embodiment, the angle is based on a tangent to the object at a point q. The two example depth pixels p1, p2 and the two points q1, q2 are depicted. The angle θ1afor point p1 is the tangent to the hand at q1. Similar reasoning applies for angle θ2a. Note that if the hand were to be rotated in the (u, v) plane, that the angle θ would change by the same amount for all pixels. Therefore, the angle θ serves as a way of describing a local orientation of a depth pixel that is in-plane rotation invariant.
InFIGS. 4A and 4B, for the purpose of illustration, the depth pixels are grouped into those with angles between 60-180, those between 180-300, and those between 300-60. In actual practice, no such grouping is required. Also, note that it is not required that the angle assignment be between 0-360 degrees. For example, it could be between −180 to +180 degrees, or another scheme. It may also be between 0-180, in which case the feature transform is rotationally invariant only up to a two-way ambiguity.
FIG. 4C is a flowchart of one embodiment of aprocess450 of assigning an angle to a depth pixel. Theprocess450 may be performed once for each depth pixel in an object of interest. Inprocess450, the angle is determined relative to the nearest edge of the object. Therefore,process450 may be used for either embodiment ofFIG. 4A or4B. Note thatprocess450 is one embodiment of estimating local orientation ofstep310. In particular,process450 is one embodiment of estimating in-plane local orientation.
Instep452, edges of the object are detected. The edge is one example of a reference line of the object of interest. A variety of edge detection techniques may be used. Since edge detection is well-known by those of ordinary skill in the art it will not be discussed in detail. Note that edge detection could be performed in a step prior to step310.
Instep456, the closest edge to the present depth pixel is determined. For example, q1 inFIG. 4A or4B is determined as the closest point on the edge of the hand to depth pixel p1. Likewise, q2 is determined as the closest point on the edge of the hand to depth pixel p2, when processing that depth pixel. This can be efficiently computed using, for example, distance transforms.
Instep458, a rotation invariant angle to assign to the depth pixel is determined. In one embodiment, the angle may be defined based on the tangent to the edge of the hand at the edge point (e.g., p1, p2). This angle is one example of a rotation invariant angle for the closest edge point. Since the closest edge point (e.g., q1) is associated to the depth pixel (p1), the angle may also be considered to be one example of a rotation invariant angle for the depth pixel. As noted, any convenient reference axis may be used, such as the u-axis of the depth map. This angle is assigned to the present depth pixel. Referring toFIG. 4B, θ1bis assigned to p1 and θ2bis assigned to p2.
In one embodiment, the angle may be defined based on the technique shown inFIG. 4A. As noted, any convenient reference axis may be used, such as the u-axis of the depth map. In this case, the angle is defined as the angle between the u-axis and the line between p and q. This angle is assigned to the present depth pixel. Referring toFIG. 4B, θ1 is assigned to p1 and θ2 is assigned to p2.Process450 continues until angles have been assigned to all depth pixels of interest.
After all depth pixels have been assigned an angle, smoothing of the results may be performed instep460. For example, the angle of each depth pixel may be compared to its neighbors, with outliers being smoothed.
Another technique for estimating a local in-plane orientation of depth pixels is based on medial axes.FIG. 4D depicts a hand to discuss such an embodiment. A medial axis may be defined as a line that is roughly a mid-point between two edges. To some extent, a medial axis may serve to represent a skeletal model.FIG. 4D shows a depth pixel p3, with its closest medial axis point q3. The angle θ3arepresents a local orientation of depth pixel p3. Note that if the hand were to be rotated in the (u, v) plane, that the angle θ3awould change by the same amount for all pixels. Therefore, the angle θ3aserves as a way of describing a local orientation of a depth pixel that is in-plane rotation invariant. In this embodiment, the angle is defined based on the line parallel to the u-axis and the line between p and q.
FIG. 4E depicts a hand to discuss another embodiment for determining a rotation invariant angle.FIG. 4E shows a depth pixel p3, with its closest medial axis point q3. The angle θ3brepresents a local orientation of depth pixel p3. In this example, θ3bis defined based on the tangent at point q3 to the medial axis. Note that if the hand were to be rotated in the (u, v) plane, that the angle θ3bwould change by the same amount for all pixels. Therefore, the angle θ3bserves as a way of describing a local orientation of a depth pixel that is in-plane rotation invariant.
FIG. 4F is a flowchart of one embodiment of aprocess480 of assigning angles to depth pixels. In thisprocess480, the angle is determined relative to the nearest medial axis of the object. Therefore,FIGS. 4D and 4E will be referred to when discussedprocess480.Process480 is one embodiment ofsteps308 and310. In particular,process480 is one embodiment of estimating an in-plane orientation of depth pixels.
Instep482, medial axes of the object are determined. A medial axis may be defined based on the contour of the object. It can be implemented by iteratively eroding the boundaries of the object without allowing the object to break apart. The remaining pixels make up the medial axes. Medial axis computation is well-known by those of ordinary skill in the art it will not be discussed in detail. Example medial axes are depicted inFIGS. 4D and 4E. A medial axis is one example of a reference line in the object.
Next, depth pixels in the object are processed one by one. Instep486, the closest point on a medial axis to the present depth pixel is determined Referring to eitherFIG. 4D or4E, point q3 may be determined to be the closest point to p3.
Instep488, a rotation invariant angle for the depth pixel is determined. The angle may be based on the tangent to the medial axis at point q3, as depicted inFIG. 4E. The angle may also be determined based on the technique shown inFIG. 4D. As noted, any convenient reference axis may be used, such as the u-axis of the depth map. The angle is one example of a rotation invariant angle for the point q3. Since the closest medial axis point (e.g., q3) is associated to the depth pixel (p3), the angle may also be considered to be one example of a rotation invariant angle for the depth pixel. Referring to eitherFIG. 4D or4E, angle θ3aor θ3bis determined. This angle is assigned to the present depth pixel. Theprocess480 continues until angles have been assigned to all depth pixels of interest.
After all depth pixels have been assigned an angle, smoothing of the results is performed instep490. For example, the angle of each depth pixel may be compared to its neighbors, with outliers being smoothed.
As noted, the estimate of the local pixel orientation may be an estimate of the out-of-plane orientation.FIG. 5 is a flowchart of one embodiment of aprocess500 estimating local orientation of depth pixels for out-of-plane orientation. In this embodiment, the out-of-plane orientation is based on the surface normal of the object of interest at the depth pixel.Process500 is one embodiment of steps308-310 and will be discussed with reference toFIG. 6A.
Instep502, a point cloud model is developed. The point cloud model may be a 3D model in which each depth pixel in the depth map is assigned a coordinate in 3D space, for example. The point cloud may have one point for each depth pixel in the depth map, but that is not an absolute requirement. To facilitate discussion, it will be assumed that each point in the point cloud has a corresponding depth pixel in the depth map. However, note that this one-to-one correspondence is not a requirement. Herein, the term “depth point” will be used to refer to a point in the point cloud.
FIG. 6A depicts apoint cloud model605 of a hand and portion of an arm. The point cloud model is depicted within an (a, b, c) global coordinate system. Thus, an a-axis, b-axis, and c-axis of a global coordinate system are depicted. In some embodiments, two of the axes in the global coordinate system correspond to the u-axis and v-axis of the depth map. However, this correspondence is not a requirement. The position along the third axis in the global coordinate system may be determined based on the depth value for a depth pixel in the depth map. Note that thepoint cloud model605 may be generated in another manner. Also note that using apoint cloud model605 is just one way to determine a surface normal. Other techniques could be used.
In step504 ofFIG. 5, a determination is made whether there are more depth pixels to process. Note that the processing here is of depth point in thepoint cloud605.
Instep506, a surface normal is determined at the present point. By surface normal it is meant a line that is perpendicular to the surface of the object of interest. The surface normal may be determined by analyzing nearby depth points. The surface normal may be defined in terms of the (a, b, c) global coordinate system. InFIG. 6A, the surface normal is depicted as the z-axis that touches the second finger of the hand. The x-axis, y-axis, and z-axis form a local coordinate system for the pixel presently being analyzed. The local coordinate system will be further discussed below. Processing continues until a surface normal is determined for all depth points.
Instep508, smoothing of the surface normals is performed. Note that using surface normals is one example of how to determine a local orientation for depth pixels that may be used for out-of-plane rotation. However, other parameters could be determined. Also, as noted above, there may be one depth point in thepoint cloud605 for each depth pixel in the depth map. Therefore, the assignment of surface normals to depth pixels may be straightforward. However, if such a one-to-one correspondence does not exist, a suitable calculation can be made to assign surface normals to depth pixels in the depth map. Finally, it will be understood that although the discussion ofFIG. 5 was of determining a surface normal to a depth point, this is one technique for determining a local orientation of a depth pixel.
As noted instep354, after determining the local orientation of depth pixels, a local coordinate system is determined for each of the depth pixels. Figures6A and6B show an object with a local coordinate system (labeled as x-axis, y-axis, z-axis) for one of the depth points. The local coordinate system has three perpendicular axes, in this embodiment. The origin of the local coordinate system is at one of the 3D depth points in the object of interest. One axis (e.g., z-axis) is normal to the surface of the object of interest. That is, it is the surface normal at a certain depth point. Determining the x-axis and the y-axis will be discussed below. Also note that, although for purposes of illustration the local coordinate system is depicted relative to one of the depth points in thedepth cloud605, the local coordinate system is considered to be a local coordinate system for one of the depth pixels in the depth map.
A feature region orwindow604 is also depicted inFIGS. 6A-6B. The dashed lines are depicted to demonstrate the position of thefeature window604 relative to the local coordinate system. Thefeature window604 may be used to help define features (also referred to as “feature probes”). For example, a feature probe can be defined based on the origin of the local coordinate system and some point in the feature window. Note that the feature window may be transformed to the depth map prior to using the feature probe.
In an embodiment in which the object is a hand, the local coordinate system moves consistently with the hand. For example, if the hand rotates, the local coordinate system rotates by a corresponding amount. Of course, the object could be any object. Thus, more generally, the local coordinate system moves consistently with the object. In some embodiments, features are defined based on the local coordinate system. Therefore, the features may be invariant to factors such as rotation, translation, scale, etc.
Referring now toFIG. 6B, the hand has been rotated relative to the hand ofFIG. 6A. However, note that the x-axis and the y-axis are in the same position relative to the hand. The z-axis is not depicted inFIG. 6B, but it will be understood that it is still normal to the surface at the location of the depth point. Thefeature window604 is also in the same position relative to the local coordinate system. Therefore, thefeature window604 is also in the same position relative to the hand. Note that this means that if a feature is defined in the local coordinate system, that the feature will automatically rotate with the hand (or other object).
As discussed above, in some embodiments, there is a 2D coordinate system for the depth map (with each depth pixel having a depth value) and a 3D local coordinate system for each depth pixel of interest.FIG. 7 depicts animage window702 associated with a 2D depth image coordinate system and acorresponding window604 in a 3D local coordinate system. Theimage window702 represents a portion of the depth map. The point p(u, v, d) represents the test pixel from the depth map, where (u, v) are image coordinates and d is depth. The point q(u+λ cos(θ), v+λ sin(θ), d) represents the point of interest, also in the depth map. The point of interest could be any pixel in the depth image.
The arrows in theimage window702 that originate from pixel p are parallel to the u-axis and the v-axis. A line is depicted between the pixel p and the point of interest q. The angle θ is the estimated in-plane rotation, which in this example is defined as the angle between the line and a reference axis. In this example, the reference axis is the u-axis, but any reference axis could be chosen.
Referring back toFIG. 4A or4B, the point p(u, v, d) might represent one of the depth pixels, such as p1. The point q might represent the nearest point on the edge of the hand, such as q1. The angle θ might represent the angle θ1abetween as sown inFIG. 4A. The angle θ might represent the angle θ1bbetween the tangent to the edge of the hand at point q1 and some reference axis, as shown inFIG. 4B.
Referring back toFIG. 4D or4E, the point p(u, v, d) might represent the depth pixel p3. The point q might represent the nearest point on the medial axis, q3. The angle θ might represent the angle θ3abetween as depicted inFIG. 4D. The angle θ might represent the angle θ3bbetween the tangent to the medial axis at point q3 and some reference axis, as depicted inFIG. 4E. Thewindow604 in the local 3D coordinate system contains the point P, which corresponds to pixel p in the 2D depth map. For the sake of illustration, point P could be the point in the point cloud ofFIGS. 6A and 6B from which the surface normal (z-axis) originates.Window604 represents afeature window604 in the local 3D coordinate system. Examples of a local coordinate system andfeature window604 were discussed with respect toFIGS. 6B and 6B.
A vector {right arrow over (n)}, which corresponds to the surface normal, is depicted with its tail at point P. A vector {right arrow over (V)} has its tail at point P and its head at point Q. Point Q is the point in 3D space that corresponds to point q in the 2D depth map. Vectors {right arrow over (r1)} and {right arrow over (r2)} may correspond to the x-axis and the y-axis in the local coordinate system (see, for example,FIGS. 6A-6B). Techniques for transforming between the local 3D coordinate system and the 2D image coordinate system will now be discussed. These techniques may be used forstep316.
The following describes a transformation from a 3D point Xw(where the first two coordinates are usually defined between [−1,1] and the 3rdcoordinate is typically zero) in a canonical window into depth pixel coordinates x.Equation 1 states a general form for the transformation equation.
x=deHom(Φ(RSXw+{right arrow over (t)})) Eq. 1
The transformation equation applies a rotation matrix R, a diagonal scaling matrix S, and a camera projection function Φ. The vector {right arrow over (t)} is a translation. The camera matrix projects from 3D into 2D.
InEquation 1, deHom(.) is the matrix given by
In order to derive the rotation matrix R and the vector {right arrow over (t)}, the following is considered. The present pixel in the depth map being examined may be defined as p(u, v, d), where (u, v) are the depth map pixel coordinates and “d” is a depth value for the depth pixel.
Next, some point of interest “q” relative to the present depth pixel is considered. The point of interest may be any point. One example is the closest edge point, as discussed inFIGS. 4A-4C. Another example is the closest medial axis point, as discussed inFIGS. 4D-4F. However, it will be understood that some other point of interest may be determined. Note that these points of interest may be selected such that a local orientation of the depth pixel that is in-plane rotation invariant may be determined. An estimated in-plane rotation θ is also determined, as in the examples above.
Furthermore, an estimated out-of-plane rotation local orientation is determined For example, the surface normal is estimated as discussed with respect toFIG. 5.
Additionally, window scaling (sx, sy, sz) are pre-specified, with S=diag([sx, sy, sz]). This window may be used for thefeature window604. Note that if the window scaling is defined in 3D, then the window may be given actual measurements, such that after it is projected to 2D it will scale properly. For example, the window could be defined as being 100 mm on each of three sides. When projecting back to the 2D space, thefeature window604 scales properly. Referring back toFIGS. 6A-6B, thefeature window604 was depicted in two-dimensions (x, y) for clarity. However, thefeature window604 can also be defined as a three dimensional object, using the z-axis.
Referring again to transformation equation (Eq. 1), Φ(.) refers to a generic camera projection function that transforms a 3D point in the camera coordinate system into a pixel homogeneous coordinate. The inverse transformation is given by Φ−1(.). The camera projection function may be used to factor in various physical properties such as focal lengths (f1, f2), principal point (c1, c2), skew coefficient (α), lens distortion parameters etc. An example of a camera projection function that does not account for lens distortion is given by Φ(X)=KX, where K is a camera matrix as shown in Equation 3. A more general camera projection function that does account for radial distortion can be used instead. Camera projection functions are well known and, therefore, will not be discussed in detail.
The rotation matrix may be computed as inEquation 4.
R3×3=[{right arrow over (r1)} {right arrow over (r2)} {right arrow over (r3)}] Eq. 4
InEquation 4, the vector {right arrow over (r3)} may be a unitized version of the surface normal. Note that this may be the z-axis of thewindow604. The vector {right arrow over (r1)} (x-axis) may be the component of {right arrow over (V)} that is orthogonal to the surface normal. Recall that {right arrow over (V)} was defined inFIG. 7 as Q−P. The vector {right arrow over (V)} may be referred to herein as an in-plane rotation-variant vector. The vector {right arrow over (r2)} (y-axis) may be computed as the cross product of {right arrow over (r3)} and {right arrow over (r2)}. The following Equations summarize the foregoing.
{right arrow over (r3)}=unitize({right arrow over (n)}) Eq. 5
{right arrow over (r1)}=unitize({right arrow over (V)}−({right arrow over (V)}T{right arrow over (r3)}){right arrow over (r3)}) Eq. 6
{right arrow over (r2)}={right arrow over (r3)}×{right arrow over (r1)} Eq. 7
The translation vector {right arrow over (t)} may be computed as inEquation 8.
The vector {right arrow over (V)} may be computed as in Equations 9A-9C.
For a 3D feature transform, and in the absence of radial distortion, the full 3D transform may be computed as in Equations 10A and 10B.
For a 2D feature in the canonical XY-plane, the direct transformation from canonical coordinates (xw, yw) in a [−1,1] window to depth pixel coordinates in the depth map may be determined by pre-computing the homography transformation H as Equation 11A and then calculating x, as in Equation 11B.
Performing the transform in the other direction may be as inEquation 12.
As noted above, the local orientation may be based on in-plane, out-of-plane, or both.FIG. 8 is a flowchart of one embodiment of aprocess800 of establishing a local orientation for a depth pixel factoring in the different possibilities.Process800 may be repeated for each depth pixel for which a local orientation is to be determined.
Instep802, a determination is made whether an estimate of a local in-plane orientation is to be made. If so, then the in-plane estimate is made instep804. Techniques for determining a local in-plane orientation have been discussed with respect toFIGS. 4A-4F. As noted, the estimate of the in-plane orientation may be an angle with respect to some reference axis in the depth map (or 2D image coordinate system). If the in-plane estimate is not to be made, then the angle θ may be set to a default value instep806. As one example, the angle θ may be set to 0 degrees. Therefore, all depth pixels will have the same angles.
Note that regardless of whether or not the local in-plane estimate is made, the processing to determine the local coordinate system may be the same. For example, referring to Equations above that use the angle θ, the calculations may be performed in a similar manner by using the default value for θ.
Instep808, a determination is made whether an estimate of a local out-of-plane orientation is to be made. If so, then the out-of-plane estimate is made instep810. Note that if the in-plane orientation was not determined, then the out-of-plane orientation is determined instep810. Techniques for determining a local out-of-plane orientation have been discussed with respect toFIG. 5. As noted, the estimate of the out-of-plane orientation may be a surface normal of the object at a given depth pixel or point in a point cloud model. Thus, the output of the estimate may be a vector.
If the out-of-plane estimate is not to be made, then the vector may be set to a default value instep812. As one example, the vector may be set to being parallel to the optical axis of the camera. Therefore, all depth pixels will have the same vectors.
Note that regardless of whether or not the local out-of-plane estimate is made, the processing to determine the local coordinate system may be the same. For example, referring to Equations above that use the vector {right arrow over (n)}, the calculations may be performed in a similar manner by using the default value for vector {right arrow over (n)}.
FIG. 9 illustrates an example of a computing environment including a multimedia console (or gaming console)100 that may be used to implement thecomputing environment12 ofFIG. 2. Thecapture device20 may be coupled to the computing environment. As shown inFIG. 9, themultimedia console100 has a central processing unit (CPU)101 having alevel 1cache102, alevel 2cache104, and a flash ROM (Read Only Memory)106. Thelevel 1cache102 and alevel 2cache104 temporarily store data and hence reduce the number of memory access cycles, thereby improving processing speed and throughput. TheCPU101 may be provided having more than one core, and thus,additional level 1 andlevel 2caches102 and104. Theflash ROM106 may store executable code that is loaded during an initial phase of a boot process when themultimedia console100 is powered ON.
A graphics processing unit (GPU)108 and a video encoder/video codec (coder/decoder)114 form a video processing pipeline for high speed and high resolution graphics processing. Data is carried from thegraphics processing unit108 to the video encoder/video codec114 via a bus. The video processing pipeline outputs data to an A/V (audio/video)port140 for transmission to a television or other display. Amemory controller110 is connected to theGPU108 to facilitate processor access to various types ofmemory112, such as, but not limited to, a RAM (Random Access Memory).
Themultimedia console100 includes an I/O controller120, asystem management controller122, anaudio processing unit123, anetwork interface controller124, a first USB host controller126, asecond USB controller128 and a front panel I/O subassembly130 that are preferably implemented on amodule118. TheUSB controllers126 and128 serve as hosts for peripheral controllers142(1)-142(2), awireless adapter148, and an external memory device146 (e.g., flash memory, external CD/DVD ROM drive, removable media, etc.). Thenetwork interface124 and/orwireless adapter148 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.
System memory143 is provided to store application data that is loaded during the boot process. A media drive144 is provided and may comprise a DVD/CD drive, hard drive, or other removable media drive, etc. The media drive144 may be internal or external to themultimedia console100. Application data may be accessed via the media drive144 for execution, playback, etc. by themultimedia console100. The media drive144 is connected to the I/O controller120 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).
Thesystem management controller122 provides a variety of service functions related to assuring availability of themultimedia console100. Theaudio processing unit123 and anaudio codec132 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is carried between theaudio processing unit123 and theaudio codec132 via a communication link. The audio processing pipeline outputs data to the A/V port140 for reproduction by an external audio player or device having audio capabilities.
The front panel I/O subassembly130 supports the functionality of thepower button150 and theeject button152, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of themultimedia console100. A systempower supply module136 provides power to the components of themultimedia console100. Afan138 cools the circuitry within themultimedia console100.
TheCPU101,GPU108,memory controller110, and various other components within themultimedia console100 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures can include a Peripheral Component Interconnects (PCI) bus, PCI-Express bus, etc.
When themultimedia console100 is powered ON, application data may be loaded from thesystem memory143 intomemory112 and/orcaches102,104 and executed on theCPU101. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on themultimedia console100. In operation, applications and/or other media contained within the media drive144 may be launched or played from the media drive144 to provide additional functionalities to themultimedia console100.
Themultimedia console100 may be operated as a standalone system by simply connecting the system to a television or other display. In this standalone mode, themultimedia console100 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through thenetwork interface124 or thewireless adapter148, themultimedia console100 may further be operated as a participant in a larger network community.
When themultimedia console100 is powered ON, a set amount of hardware resources are reserved for system use by the multimedia console operating system. These resources may include a reservation of memory (e.g., 16 MB), CPU and GPU cycles (e.g., 5%), networking bandwidth (e.g., 8 kbs), etc. Because these resources are reserved at system boot time, the reserved resources do not exist from the application's view.
In particular, the memory reservation preferably is large enough to contain the launch kernel, concurrent system applications and drivers. The CPU reservation is preferably constant such that if the reserved CPU usage is not used by the system applications, an idle thread will consume any unused cycles.
With regard to the GPU reservation, lightweight messages generated by the system applications (e.g., popups) are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for an overlay depends on the overlay area size and the overlay preferably scales with screen resolution. Where a full user interface is used by the concurrent system application, it is preferable to use a resolution independent of application resolution. A scaler may be used to set this resolution such that the need to change frequency and cause a TV resynch is eliminated.
After themultimedia console100 boots and system resources are reserved, concurrent system applications execute to provide system functionalities. The system functionalities are encapsulated in a set of system applications that execute within the reserved system resources described above. The operating system kernel identifies threads that are system application threads versus gaming application threads. The system applications are preferably scheduled to run on theCPU101 at predetermined times and intervals in order to provide a consistent system resource view to the application. The scheduling is to minimize cache disruption for the gaming application running on the console.
When a concurrent system application requires audio, audio processing is scheduled asynchronously to the gaming application due to time sensitivity. A multimedia console application manager controls the gaming application audio level (e.g., mute, attenuate) when system applications are active.
Input devices (e.g., controllers142(1) and142(2)) are shared by gaming applications and system applications. The input devices are not reserved resources, but are to be switched between system applications and the gaming application such that each will have a focus of the device. The application manager preferably controls the switching of input stream, without knowledge the gaming application's knowledge and a driver maintains state information regarding focus switches. In some embodiments, thecapture device20 ofFIG. 2 may be an additional input device tomultimedia console100.
FIG. 10 illustrates another example of a computing environment that may be used to implement thecomputing environment12 ofFIG. 2. Thecapture device20 may be coupled to the computing environment. The computing environment ofFIG. 10 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the presently disclosed subject matter. Neither should thecomputing environment12 ofFIG. 2 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment ofFIG. 10. In some embodiments, the various depicted computing elements may include circuitry configured to instantiate specific aspects of the present disclosure. For example, the term circuitry used in the disclosure can include specialized hardware components configured to perform function(s) by firmware or switches. In other examples, the term circuitry can include a general-purpose processing unit, memory, etc., configured by software instructions that embody logic operable to perform function(s). In embodiments where circuitry includes a combination of hardware and software, an implementer may write source code embodying logic and the source code can be compiled into machine readable code that can be processed by the general purpose processing unit.
InFIG. 10, thecomputing system220 comprises acomputer241, which typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed bycomputer241 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
Thecomputer241 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example,FIG. 10 illustrates a hard disk drive238 that reads from or writes to non-removable, nonvolatile magnetic media, amagnetic disk drive239 that reads from or writes to a removable, nonvolatilemagnetic disk254, and anoptical disk drive240 that reads from or writes to a removable, nonvolatileoptical disk253 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive238 is typically connected to the system bus221 through a non-removable memory interface such asinterface234, andmagnetic disk drive239 andoptical disk drive240 are typically connected to the system bus221 by a removable memory interface, such asinterface235.
A basic input/output system224 (BIOS), containing the basic routines that help to transfer information between elements withincomputer241, such as during start-up, is typically stored inROM223.RAM260 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processingunit259. By way of example, and not limitation,FIG. 10 illustratesoperating system225,application programs226,other program modules227, andprogram data228.
The drives and their associated computer storage media discussed above and illustrated inFIG. 10, provide storage of computer readable instructions, data structures, program modules and other data for thecomputer241. InFIG. 10, for example, hard disk drive238 is illustrated as storingoperating system258,application programs257,other program modules256, andprogram data255. Note that these components can either be the same as or different fromoperating system225,application programs226,other program modules227, andprogram data228.Operating system258,application programs257,other program modules256, andprogram data255 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into thecomputer241 through input devices such as akeyboard251 andpointing device252, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to theprocessing unit259 through auser input interface236 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). Thecameras34,36 andcapture device20 ofFIG. 2 may define additional input devices for thecomputer241. Amonitor242 or other type of display device is also connected to the system bus221 via an interface, such as avideo interface232. In addition to the monitor, computers may also include other peripheral output devices such asspeakers244 andprinter243, which may be connected through a outputperipheral interface233.
Thecomputer241 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer246. The remote computer246 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to thecomputer241, although only amemory storage device247 has been illustrated inFIG. 5. The logical connections depicted inFIG. 5 include a local area network (LAN)245 and a wide area network (WAN)249, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, thecomputer241 is connected to theLAN245 through a network interface oradapter237. When used in a WAN networking environment, thecomputer241 typically includes amodem250 or other means for establishing communications over theWAN249, such as the Internet. Themodem250, which may be internal or external, may be connected to the system bus221 via theuser input interface236, or other appropriate mechanism. In a networked environment, program modules depicted relative to thecomputer241, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,FIG. 10 illustratesremote application programs248 as residing onmemory device247. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
The disclosed technology is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the technology include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The disclosed technology may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, software and program modules as described herein include routines, programs, objects, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Hardware or combinations of hardware and software may be substituted for software modules as described herein.
The disclosed technology may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.