Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras orlaser scanners. This database may then be used to producecomputer graphics (CG),computer animation for movies, games, or real-time avatars. Because the motion of CGI characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.
A facialmotion capture database describes the coordinates or relative positions of reference points on the actor's face. The capture may be in two dimensions, in which case the capture process is sometimes called "expression tracking", or in three dimensions. Two-dimensional capture can be achieved using a single camera and capture software. This produces less sophisticated tracking, and is unable to fully capture three-dimensional motions such as head rotation. Three-dimensional capture is accomplished usingmulti-camera rigs or laser marker system. Such systems are typically far more expensive, complicated, and time-consuming to use. Two predominant technologies exist: marker and markerless tracking systems.
Facial motion capture is related to body motion capture, but is more challenging due to the higher resolution requirements to detect and track subtle expressions possible fromsmall movements of the eyes and lips. These movements are often less than a few millimeters, requiring even greater resolution and fidelity and different filtering techniques than usually used in full body capture. The additional constraints of the face also allow more opportunities for using models and rules.
Facial expression capture is similar to facial motion capture. It is a process of using visual or mechanical means to manipulate computer-generated characters with input from humanfaces, or torecognize emotions from a user.
This section needs to beupdated. Please help update this article to reflect recent events or newly available information.(September 2019) |
One of the first papers discussing performance-driven animation was published byLance Williams in 1990. There, he describes 'a means of acquiring the expressions of realfaces, and applying them to computer-generated faces'.[1]
Traditional marker-based systems apply up to 350 markers to the actorsface and track the marker movement with high resolutioncameras. This has been used on movies such asThe Polar Express andBeowulf to allow an actor such asTom Hanks to drive the facial expressions of several different characters. Unfortunately, this is relatively cumbersome and makes the actors expressions overly driven once the smoothing and filtering have taken place. Next generation systems such asCaptiveMotion utilize offshoots of the traditional marker-based system with higher levels of details.
Active LED Marker technology is currently being used to drive facial animation in real-time to provide user feedback.
Markerless technologies use the features of the face such asnostrils, the corners of the lips and eyes, and wrinkles and then track them. This technology is discussed and demonstrated atCMU,[2]IBM,[3]University of Manchester (where much of this started withTim Cootes,[4] Gareth Edwards and Chris Taylor) and other locations, usingactive appearance models,principal component analysis,eigen tracking,deformable surface models and other techniques to track the desired facial features fromframe to frame. This technology is much less cumbersome, and allows greater expression for the actor.
These vision-based approaches also have the ability to track pupil movement, eyelids, teeth occlusion by the lips and tongue, which are obvious problems in most computer-animated features. Typical limitations of vision-based approaches are resolution and frame rate, both of which are decreasing as issues as high speed, high resolutionCMOS cameras become available from multiple sources.
The technology for markerless face tracking is related to that in aFacial recognition system,since a facial recognition system can potentially be applied sequentially to each frameof video, resulting in face tracking.For example, the Neven Vision system[5] (formerly Eyematics, now acquired by Google) allowed real-time2D face tracking with no person-specific training; their system was also amongst the best-performing facial recognition systems in the U.S. Government's 2002 Facial Recognition Vendor Test (FRVT).On the other hand, some recognition systems do not explicitly track expressions oreven fail on non-neutral expressions, and so are not suitable for tracking.Conversely, systems such asdeformable surface models pool temporal information to disambiguate and obtain more robust results, and thus could not be applied from a single photograph.
Markerless face tracking has progressed to commercial systems such asImage Metrics, which has been applied in movies such asThe Matrix sequels[6]andThe Curious Case of Benjamin Button.The latter used theMova system to capture a deformablefacial model, which was then animated with a combination of manual andvision tracking.[7]Avatar was another prominent motion capture movie; however, it used painted markersrather than being markerless.Dynamixyz[permanent dead link] is another commercial system currently in use.
Markerless systems can be classified according to several distinguishing criteria:
To date, no system is ideal with respect to all these criteria. For example, the Neven Visionsystem was fully automatic and required no hidden patterns or per-person training, but was 2D.The Face/Off system[8] is 3D, automatic, and real-time but requires projected patterns.
Digital video-based methods are becoming increasingly preferred, as mechanical systems tend to be cumbersome and difficult to use.
Usingdigital cameras, the input user's expressions are processed to provide the headpose, which allows the software to then find the eyes, nose and mouth. The face is initially calibrated using a neutral expression. Then depending on the architecture, the eyebrows, eyelids, cheeks, and mouth can be processed as differences from the neutral expression. This is done by looking for the edges of the lips for instance and recognizing it as a unique object. Often contrast enhancing makeup or markers are worn, or some other method to make the processing faster. Like voice recognition, the best techniques are only good 90 percent of the time, requiring a great deal of tweaking by hand, or tolerance for errors.
Since computer-generated characters don't actually havemuscles, different techniques are used to achieve the same results. Some animators create bones or objects that are controlled by the capture software, and move them accordingly, which when the character is rigged correctly gives a good approximation. Since faces are very elastic this technique is often mixed with others, adjusting the weights differently for theskin elasticity and other factors depending on the desired expressions.
Several commercial companies are developing products that have been used, but are rather expensive.[citation needed]
It is expected that this will become a majorinput device for computer games once the software is available in an affordable format, but the hardware and software do not yet exist, despite the research for the last 15 years producing results that are almost usable.[citation needed]
The first application that got wide adoption is communication. Initially, video telephony and multimedia messaging, and later in 3D with mixed reality headsets.
With the advance ofmachine learning, computing power and advanced sensors, especially on mobile phones, facial motion capture technology became widely available. Two notable examples are Snapchat'slens feature and Apple's Memoji[9] that can be used to record messages with avatars or live via theFaceTime app. With these applications (and many other) most modern mobile phones today are capable of performing real-time facial motion capture!More recently, real-time facial motion capture, combined with realistic 3-Davatars were introduced to enable immersive communication inmixed reality (MR) andvirtual reality (VR). Meta demonstrated their Codec Avatars to communicate via their MR headsetMeta Quest Pro to record a podcast with two remote participants.[10]Apple's MR headsetApple Vision Pro also supports real-time facial motion capture that can be used with applications such asFaceTime. Real-time communication applications prioritize lowlatency to facilitate natural conversation and ease of use, aiming to make the technology accessible to a broad audience. These considerations may limit on the possible accuracy of the motion capture.