Elsevier

Pattern Recognition Letters

Volume 28, Issue 15, 1 November 2007, Pages 2116-2126
Pattern Recognition Letters

Rigid and non-rigid face motion tracking by aligning texture maps and stereo 3D models

https://doi.org/10.1016/j.patrec.2007.06.011Get rights and content

Abstract

Accurate rigid and non-rigid tracking of faces is a challenging task in computer vision. Recently, appearance-based 3D face tracking methods have been proposed. These methods can successfully tackle the image variability and drift problems. However, they may fail to provide accurate out-of-plane face motions since they are not very sensitive to out-of-plane motion variations. In this paper, we present a framework for fast and accurate 3D face and facial action tracking. Our proposed framework retains the strengths of both appearance and 3D data-based trackers. We combine an adaptive appearance model with an online stereo-based 3D model. We provide experiments and performance evaluation which show the feasibility and usefulness of the proposed approach.

Introduction

The ability to detect and track human heads and faces in video sequences is useful in a great number of applications, such as human–computer interaction and gesture recognition. There are several commercial products capable of accurate and reliable 3D head position and orientation estimation (e.g., the acoustic tracker system Mouse [www.vrdepot.com/vrteclg.htm]). These are either based on magnetic sensors or on special markers placed on the face; both practices are encumbering, causing discomfort and limiting natural motion. Vision-based 3D head tracking provides an attractive alternative since vision sensors are not invasive and hence natural motions can be achieved (Moreno et al., 2002). However, detecting and tracking faces in video sequences is a challenging task due to the image variability caused by pose, expression, and illumination changes.

Recently, deterministic and statistical appearance-based 3D head tracking methods have been proposed and used by some researchers (Cascia et al., 2000, Ahlberg, 2002, Matthews and Baker, 2004). These methods can successfully tackle the image variability and drift problems by using deterministic or statistical models for the global appearance of a special object class: the face. However, appearance-based methods dedicated to full 3D head tracking may suffer from some inaccuracies since these methods are not very sensitive to out-of-plane motion variations. On the other hand, the use of dense 3D facial data provided by a stereo rig or a range sensor can provide very accurate 3D face motions. However, computing the 3D face motions from the stream of dense 3D facial data is not straightforward. Indeed, inferring the 3D face motion from the dense 3D data needs an additional process. This process can be the detection of some particular facial features in the range data/images from which the 3D head pose can be inferred. For example, in (Malassiotis and Strintzis, 2005), the 3D nose ridge is detected and then used for computing the 3D head pose. Alternatively, one can perform a registration between 3D data obtained at different time instants in order to infer the relative 3D motions. The most common registration technique is the iterative closest point (ICP) (Besl and McKay, 1992) algorithm. This algorithm and its variants can provide accurate 3D motions but their significant computational cost prohibits real-time performance. Moreover, this algorithm is intended for registering rigid objects.

Classical 3D face tracking algorithms that are based on 3D facial features are subject to drift problems. Moreover, these algorithms cannot compute the facial actions due, for instance, to facial expressions.

The main contribution of this paper is a robust 3D face tracker that combines the advantages of both appearance-based trackers and 3D data-based trackers while keeping the CPU time very close to that required by real-time trackers. In our work, we use the Candide deformable 3D model (Ahlberg, 2001) which is a simple model embedding non-rigid facial motion using the concept of facial actions. Our proposed framework for tracking faces in videos can be summarized as follows. First, the 3D head pose and some facial actions are estimated from the monocular image by registering the warped input texture with a shape-free facial texture map. Second, based on these current parameters the 2D locations of the mesh vertices are inferred by projecting the current mesh onto the current video frame. Then the 3D coordinates of these vertices are computed by stereo reconstruction. Third, the relative 3D face motion is then obtained using a robust 3D-to-3D registration technique between two meshes corresponding to the first video frame and the current video frame, respectively. Our framework attempts to reduce the number of outlier vertices by deforming the meshes according to the same current facial actions and by exploiting the symmetrical shape of the 3D mesh.

The resulting 3D face and facial action tracker is accurate, fast, and drift insensitive. Moreover, unlike many proposed frameworks (e.g., Xiao et al., 2004), it does not require any learning stage since it is based on online facial appearances and online stereo 3D data.

The remainder of the paper proceeds as follows. Section 2 introduces our deformable 3D facial model. Section 3 states the problem we are focusing on, and summarizes the appearance-based monocular tracker that tracks in real-time the 3D head pose and some facial actions. It gives some evaluation results. Section 4 describes a robust 3D-to-3D registration that combines the monocular tracker’s results and the stereo-based reconstructed vertices. Section 5 gives some experimental results.

Section snippets

Modeling faces

In this section, we briefly describe our deformable face model and explain how to produce a shape-free facial texture map.

Problem formulation

Given a monocular video sequence depicting a moving head/face, we would like to recover, for each frame, the 3D head pose and the facial actions encoded by the control vector τa. In other words, we would like to estimate the vector bt (Eq. (2)) at time t given all the observed data until time t, denoted y1:t  {y1,  , yt}. In a tracking context, the model parameters associated with the current frame will be handed over to the next frame.

In (Dornaika and Davoine, 2006), we have developed a fast

Tracking by aligning texture maps and stereo-based 3D models

In this section, we propose a novel tracking scheme that aims at computing a fast and accurate 3D face motion. To this end, we exploit the tracking results provided by the appearance-based tracker (Section 3) and the availability of a stereo system for reconstructing the mesh vertices. The whole algorithm is outlined in Fig. 3. The three stages are applied to each video frame—stereo pair. Note that the facial actions are already computed using the monocular tracker described in Section 3.

Our

Experimental results

We use the commercial stereo camera system Bumblebee from Point Grey (http://www.ptgrey.com). It consists of two Sony ICX084 color CCDs with 6 mm focal length lenses. The monocular sequence is used by the monocular tracker (Section 3), while the stereo sequence is used by the 3D-to-3D registration technique (Section 4). Fig. 5 (top) shows the face and facial action tracking results associated with a 300-frame sequence (only three frames are shown). The tracking results were obtained using the

Conclusion

In this paper, we have proposed a robust 3D face and facial action tracker that combines the advantages of both appearance-based trackers and 3D data-based trackers while keeping the CPU time very close to that required by real-time trackers. Experiments on real video sequences indicate that the estimates of the out-of-plane motions of the head can be considerably improved by combining a robust 3D-to-3D registration with the appearance model.

Although the joint use of 3D facial data and the ICP

Acknowledgment

This work was supported in part by the MEC project TRA2004-06702/AUT and The Ramón y Cajal Program.

References (17)

There are more references available in the full text version of this article.

Cited by (0)

View full text