Disclosure of Invention
In view of the technical problems, the invention provides a dynamic eye tracking method for naked eye 3D and a tablet personal computer, which are used for solving the problems that the existing naked eye 3D screen generates double images when the position of an observer is not correct enough or the observer moves, and meanwhile, the naked eye 3D effect cannot be achieved when the observer deviates beyond a certain degree.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to an aspect of the present invention, a dynamic eye tracking method for naked eye 3D is disclosed, the method comprising:
Acquiring a video frame shot by a tablet computer, and detecting an eye and nose area of the video frame based on an error learning frame, wherein a detector of the error learning frame comprises an Adaboost algorithm with a local binary mode;
performing content classification on the eye-nose area, wherein the content classification conditions comprise whether glasses are worn, illumination environment types, glasses are reflected and thick glasses, extracting SIFT features from the eye-nose area based on a scale-invariant feature transformation algorithm, selecting a corresponding supervision descent model according to a content classification result, training a plurality of descent directions for the SIFT features to minimize the average value of nonlinear square functions of each landmark point, so that the SIFT features execute regression-based landmark point alignment, the eye-nose shapes of the eye-nose area are aligned, and the first coordinates of the pupil center of human eyes are obtained;
and estimating the three-dimensional coordinate of the pupil center according to the general 3D face model and the first coordinate, and dynamically adjusting the pixel arrangement mode of the screen of the tablet personal computer and the position of the cylindrical lens grating based on the three-dimensional coordinate so that the left eye image and the right eye image projected by the tablet personal computer are matched with the three-dimensional coordinate of the pupil center.
Further, the error learning framework comprises an early stage, a middle stage and a mature stage during training, wherein:
In an early stage, detecting local texture information of a nose and eyes of a human face from a first sample image set based on the local binary pattern of the Adaboost algorithm, training a first cascade classifier based on the local texture information, applying the first cascade classifier to a second sample image set, and collecting sample images with wrong classification of the first cascade classifier, wherein the first sample image set comprises a positive sample image and a negative sample image, the positive sample image is an image with the human face, and the negative sample is an image without the human face;
In the middle stage, setting and adjusting to obtain a second cascade classifier based on the sample image with the classification error in the early stage, classifying the sample image with the classification error in the early stage by using the second cascade classifier, and collecting the sample image with the reclassification error;
in the mature stage, setting a threshold value of classification accuracy, and continuously adjusting the second cascade classifier until the classification accuracy of the second cascade classifier on the sample image which is classified again in error reaches the threshold value.
Further, the method further comprises:
After the first coordinate of the pupil center is obtained, the eye and nose area aligned based on a tracking checker is checked to check that the aligned result correctly contains the pupil, wherein the tracking checker comprises SIFT features, a pre-trained support vector machine classifier and an Adaboost algorithm;
performing calculation of the first coordinate of the pupil center of a subsequent frame of the video frame when the tracking checker checks that the alignment result is incorrect;
And when the tracking checker checks that the alignment result is correct, maintaining the alignment result, and tracking and checking the pupil center in a subsequent frame of the video frame based on the tracking checker by taking the first coordinate as a starting coordinate.
Further, the tracking checker includes, when checking:
Extracting the SIFT features based on the eye and nose regions aligned by the support vector machine classifier, classifying the extracted SIFT features, and judging whether the SIFT features comprise pupils or not;
Capturing local texture features in the eye-nose region based on a local binary pattern in the Adaboost algorithm to determine whether a pupil is contained in the eye-nose region;
and carrying out weighted fusion on the judgment results of the support vector machine classifier and the Adaboost algorithm to obtain a final inspection result.
Further, when the tracking checker is trained, the judgment result of the Adaboost algorithm is identified, and samples with the error identification of the Adaboost algorithm are input into the support vector machine classifier so as to train the support vector machine classifier.
Further, when estimating the three-dimensional coordinates of the pupil center, calculating along a face normal direction, and combining a fixed pupil distance to obtain the three-dimensional coordinates of the pupil center, wherein the face normal direction is estimated by the alignment condition of the 3D face model and the preset mark point of the eye-nose area.
Further, the method further comprises:
And when the video frames are shot by the tablet personal computer through the near infrared camera and the content is classified, if the content classification result comprises eyeglass reflection, repairing the reflection area in the eye-nose area.
Further, the repairing the reflection area in the eye-nose area includes:
detecting the reflection region in the eye-nose region based on pixel gradient information;
Performing edge minimization on the reflection area, wherein the performing of the edge minimization comprises a smoothing method and an interpolation method;
the gradient difference of the reflection area is minimized based on the pixel gradient information.
According to another aspect of the present invention, a dynamic eye tracking tablet for naked eye 3D is disclosed, the tablet comprising:
The eye and nose detection module is used for acquiring a video frame shot by the front-end camera, detecting an eye and nose area of the video frame based on the error learning frame, and the detector of the error learning frame comprises an Adaboost algorithm with a local binary mode;
The eye-nose alignment module is used for carrying out content classification on the eye-nose area, wherein the content classification conditions comprise whether glasses are worn, illumination environment types, glasses reflection and thick glasses, extracting SIFT features from the eye-nose area based on a scale-invariant feature transformation algorithm, selecting a corresponding supervision descent model according to the content classification result, training a plurality of descent directions on the SIFT features to minimize the average value of nonlinear square functions of each landmark point, so that the SIFT features execute regression-based landmark point alignment, and the eye-nose shapes of the eye-nose area are aligned to obtain first coordinates of the pupil center of human eyes;
The control display module is used for estimating the three-dimensional coordinate of the pupil center according to the general 3D face model and the first coordinate, and dynamically adjusting the pixel arrangement mode of the screen and the position of the cylindrical lens grating based on the three-dimensional coordinate so that the left eye image and the right eye image projected by the screen are matched with the three-dimensional coordinate of the pupil center.
The technical scheme of the present disclosure has the following beneficial effects:
the method can meet the tracking of pupil centers under various environmental conditions based on content classification, and the adopted classification, detection and tracking methods only need CPU resources and do not depend on GPU calculation, so that the method can be executed under lower system resource conditions;
According to the method and the device, the pupil position is identified in real time, the pixel arrangement of the video frame can be dynamically adjusted, the projected left eye image and right eye image are adjusted in real time, and the observer can obtain good naked eye 3D experience under the condition of position variation.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are only schematic illustrations of the present disclosure. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
In an embodiment, as shown in fig. 1, the present disclosure provides a dynamic eye tracking method for naked eye 3D, where the implementation subject of the method may be a computer, a server, a mobile phone, a tablet computer, etc. The method specifically comprises the following steps S101-S103:
in step S101, a video frame captured by a tablet computer is acquired, and the video frame is detected for an eye-nose area based on an error learning frame, and a detector of the error learning frame includes using an Adaboost algorithm with a local binary pattern.
The video frame shot can be shot by a visible light camera or an infrared camera, the video frame is one or more frames in the shot video stream, such as a first section of frame, then the video frame is subjected to detection of an eye-nose area, and if the eye-nose area cannot be detected through the error learning frame, the next section of frame or other preset frames is skipped until the eye-nose area is detected from the frames.
In step S102, content classification is performed on the eye-nose area, where the condition of content classification includes whether to wear glasses, illumination environment type, reflection of glasses, and thick glasses, based on a scale-invariant feature transform algorithm, extraction of SIFT features from the eye-nose area, selection of a corresponding supervised descent model according to the result of content classification, and training of multiple descent directions on the SIFT features to minimize the average value of nonlinear square functions of each landmark point, so that the SIFT features perform regression-based landmark point alignment, so that the eye-nose shape of the eye-nose area is aligned, and a first coordinate of the pupil center of the human eye is obtained.
Among them, since the Supervised Descent Method (SDM) is a regression-based method, which solves the problem of shape alignment as a general optimization problem, a single supervised descent model cannot easily perform global optimization on images under various conditions. To address this problem, the eye-nose area is content classified, and a corresponding supervised descent model is applied to align the eye-nose shape according to image quality and conditions (whether glasses are worn, the type of illumination environment is dark or bright, infrared photographs, visible photographs, glasses reflections, thick glasses). Thick glasses refer to the excessively large and thick lenses of the glasses, and are identified by preset classification conditions. The content classification can be specifically classified based on machine algorithms such as a Support Vector Machine (SVM), a decision tree, a random forest and the like, when the classifier is built, proper features are selected to represent image content, the features can effectively distinguish different image content categories such as brightness, contrast, edges and the like, a proper machine learning algorithm is selected to train the classifier, and then a labeled image data set is used for training the classifier, so that each type of image is ensured to have enough samples. The classifier is trained by training data so that it can learn the characteristic representation of different image contents, then the trained classifier is applied to new video frames divided into eye and nose areas, the images are classified, the content categories of each image, such as 'bright image', 'dark image', 'spectacle reflection image', and the like, which can be mutually intersected and coexist, are obtained according to the output of the classifier, and finally the corresponding SDM aligner is selected according to the content categories of the images, for example, for 'dark image', a supervised descent model optimized for 'dark image' is selected to process the images under low illumination.
Additionally, in the alignment of the eye and nose regions, a Scale Invariant Feature Transform (SIFT) algorithm is an algorithm for image feature detection and description, which can effectively identify key points in an image and extract descriptors of the key points, so that the SIFT features have good invariance to illumination changes, rotation, scaling and the like. During extraction of SIFT features, key points in an image are detected through a Gaussian pyramid, the key points are significant feature points in the image, such as edges and corner points, descriptors of local neighborhoods of the key points are calculated for each detected key point, and gradient information of areas around the key points is contained in the descriptors, so that high-dimensional feature vectors, namely SIFT features, are formed. The supervised descent method is a regression-based method for shape alignment problems by learning the change from initial estimation to the target shape to minimize the objective function for efficient shape alignment. In this embodiment, when the supervised descent model is trained, initial estimation is performed first, that is, an initial alignment result of the eye-nose area obtained from the error learning framework is used as an input of the model, then a regression model is constructed to learn the movement of the key points, a series of descent directions are trained by the model through multiple iterations to adjust the key point positions, an objective function is set as a nonlinear square error of each key point, and the objective of the model is to minimize the error between the predicted coordinates and the real coordinates by optimizing the objective function. Specifically, in the alignment, the landmark points are aligned first, the landmark points refer to positions of key points (such as pupil centers, eye corners, nasal tips and the like) in the eye-nose area, the SIFT features extracted from the eye-nose area are taken as input, information about the positions of the key points is provided, the supervised descent model gradually optimizes the positions of the landmark points through regression learning according to the positions of the key points output by the SIFT features so as to improve the alignment accuracy, and in each iteration, the supervised descent model calculates errors according to the current landmark points and adjusts the positions so as to reduce the errors, thereby forming a series of descent directions. Updating the rule at each descent, the updating rule being implemented by a nonlinear least squares method, each iterative optimization proceeding towards a reduction of the nonlinear square function of each landmark point. And after multiple iterations and optimization, the supervised descent model adjusts the position of the landmark point to an optimal state, so that the accurate coordinate of the pupil center is extracted, and the final output result comprises the coordinate of the key point in the eye-nose area and the calculated accurate coordinate of the pupil center.
In step S103, a three-dimensional coordinate of the pupil center is estimated according to the general 3D face model and the first coordinate, and a pixel arrangement mode of a screen of the tablet personal computer and a position of a lenticular lens grating are dynamically adjusted based on the three-dimensional coordinate, so that a left eye image and a right eye image projected by the tablet personal computer are matched with the three-dimensional coordinate of the pupil center.
The general 3D face model may be Candide-3, and when estimating the three-dimensional coordinates of the pupil center, the three-dimensional coordinates of the pupil center are obtained by calculating along a face normal direction and combining a fixed pupil distance, wherein the face normal direction is estimated by the alignment condition of the 3D face model and the preset mark point of the eye-nose area.
In addition, the calculation of the required pixel arrangement comprises the steps of dynamically adjusting the pixel arrangement mode on a screen of the tablet personal computer according to the three-dimensional coordinates of the pupil center, particularly repositioning the pixel arrangement of the screen by calculating the deviation between the current pupil position and an ideal observation point, for example, if the pupil center is deviated, adjusting the virtual pixel position of the screen to ensure that images of the left eye and the right eye are correctly overlapped, and adjusting the position of the cylindrical lens grating comprises the steps of adjusting the position of the cylindrical lens grating according to the three-dimensional coordinates of the pupil center to ensure that the grating can properly guide light under different visual angles, enabling the image seen by each eye to meet the stereoscopic visual effect, and ensuring that the cylindrical lens can effectively separate the images of the left eye and the right eye by calculating the refraction and perspective relation of the light under different coordinates.
In one embodiment, in the prior art, the detection of the eye-nose area is mostly dependent on a neural network, which consumes a lot of computing resources, and here, the embodiment adopts an Error-Based Learning (EBL) framework, which is a heuristic Learning method that can simulate the process of Learning from errors by human beings to improve the performance of the machine Learning model. The error learning framework comprises an early stage, a middle stage and a mature stage in training, and the training steps comprise S201-203:
S201, in an early stage, detecting local texture information of a nose and eyes of a human face from a first sample image set based on the local binary pattern of the Adaboost algorithm, training a first cascade classifier based on the local texture information, applying the first cascade classifier to a second sample image set, and collecting sample images with wrong classification of the first cascade classifier, wherein the first sample image set comprises a positive sample image and a negative sample image, the positive sample image is an image with the human face, and the negative sample is an image without the human face.
Wherein an Adaboost algorithm framework is used for basic training, which framework extracts features from images by means of Local Binary Patterns (LBP), and then trains a first cascade classifier based on these features. The first cascade classifier is trained on a smaller first sample image set. After training is completed, the first cascade classifier is applied to a second sample image set of a larger scale to identify those images that were misclassified. These misclassified images will become training samples for the next mid-stage, while those correctly classified images are deleted.
And S202, in the middle stage, setting and adjusting to obtain a second cascade classifier based on the sample image with the classification error in the early stage, classifying the sample image with the classification error in the early stage by using the second cascade classifier, and collecting the sample image with the reclassification error.
The emphasis of training is transferred to samples which are misclassified at an early stage, are generally harder to classify, are most challenging to improve the model, and then the classifier is redesigned and adjusted based on the misclassified samples to obtain a second cascade of classifiers, thereby further improving the classification capability of the classifier for complex situations. In this process, the classifier is optimized more pertinently, so that the classifier gradually learns the difficulty of correctly processing early incorrect recognition.
And S203, setting a threshold value of classification accuracy in a mature stage, and continuously adjusting the second cascade classifier until the classification accuracy of the second cascade classifier on the sample image which is classified again in error reaches the threshold value. Wherein the number of samples of the first sample image set is much smaller than the second sample image set, e.g. by a factor of 10.
In the mature stage, the error learning framework focuses on samples which are still misclassified in the middle stage, and the final second cascade classifier further refines by continuing training the key samples, so that the accuracy of the second cascade classifier on all input data is improved. The training at this stage not only reduces training time, but also achieves higher classification accuracy through repeated screening and adjustment.
In one embodiment, the method further comprises:
After the first coordinate of the pupil center is obtained, the eye and nose area aligned based on a tracking checker is checked to check that the aligned result correctly contains the pupil, wherein the tracking checker comprises SIFT features, a pre-trained support vector machine classifier and an Adaboost algorithm;
performing calculation of the first coordinate of the pupil center of a subsequent frame of the video frame when the tracking checker checks that the alignment result is incorrect;
And when the tracking checker checks that the alignment result is correct, maintaining the alignment result, and tracking and checking the pupil center in a subsequent frame of the video frame based on the tracking checker by taking the first coordinate as a starting coordinate.
The tracking inspector comprises the steps of extracting SIFT features based on the eye-nose area aligned by the support vector machine classifier, judging whether the SIFT features comprise pupils after classifying the SIFT features, capturing local texture features in the eye-nose area based on a local binary pattern in the Adaboost algorithm to judge whether the eye-nose area comprises pupils, and carrying out weighted fusion on judging results of the support vector machine classifier and the Adaboost algorithm to obtain a final inspection result.
Additionally, when the tracking checker is trained, the judgment result of the Adaboost algorithm is identified, and samples with the error identification of the Adaboost algorithm are input into the support vector machine classifier so as to train the support vector machine classifier.
In an embodiment, the method further comprises the step that the tablet personal computer shoots the video frames through a near infrared camera, and when content is classified, if a content classification result comprises eyeglass reflection, the reflection area in the eye-nose area is repaired. The method for repairing the reflection area in the eye-nose area comprises the steps of detecting the reflection area in the eye-nose area based on pixel gradient information, performing edge minimization on the reflection area, wherein the edge minimization comprises a smoothing method and an interpolation method, and minimizing gradient differences of the reflection area based on the pixel gradient information.
Based on the same idea, as shown in fig. 2, an exemplary embodiment of the present disclosure further provides a dynamic eye tracking system for naked eye 3D, where the tablet computer includes:
an eye-nose detection module 201, configured to obtain a video frame captured by a front camera, and detect an eye-nose area of the video frame based on an error learning frame, where a detector of the error learning frame includes an Adaboost algorithm with a local binary pattern;
the eye-nose alignment module 202 is configured to perform content classification on the eye-nose area, where conditions of the content classification include whether to wear glasses, illumination environment types, reflection of the glasses, and thick glasses, extract SIFT features from the eye-nose area based on a scale-invariant feature transform algorithm, select a corresponding supervised descent model according to a result of the content classification, train a plurality of descent directions on the SIFT features to minimize an average value of a nonlinear square function of each landmark point, so that the SIFT features perform regression-based landmark point alignment, so that eye-nose shapes of the eye-nose area are aligned, and obtain a first coordinate of a pupil center of a human eye;
The control display module 203 is configured to estimate a three-dimensional coordinate of the pupil center according to the general 3D face model and the first coordinate, and dynamically adjust a pixel arrangement mode of the screen and a position of the lenticular lens grating based on the three-dimensional coordinate, so that a left eye image and a right eye image projected by the screen are matched with the three-dimensional coordinate of the pupil center.
In the embodiment, based on content classification, the method for classifying, detecting and tracking pupil centers under various environmental conditions can be satisfied, and the adopted method only needs CPU resources and does not depend on GPU calculation, so that the method can be executed under lower system resource conditions, and by identifying pupil positions in real time, the pixel arrangement of video frames can be dynamically adjusted, the projected left eye images and right eye images can be adjusted in real time, and a good naked eye 3D experience can be obtained by an observer under the condition of position change.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the exemplary embodiments of the present disclosure.
Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.