A featureless and stochastic approach to on-board stereo vision system pose
Introduction
In recent years, several techniques to on-board vision pose estimation have been proposed [7], [10], [23], [26], [30], [32]. The main application domain was advanced driver assistance. The proposed approaches can be broadly classified into two different categories: highways and urban. For each category, the vision sensor can be either a monocular camera or a stereo head. Most of the techniques proposed for highways environments are focused on lane and car detection, looking for an efficient driver assistance system. On the other hand, in general, techniques for urban environments are focused on collision avoidance or pedestrian detection. Although in both domains a similar objective is pursued, it is very challenging to develop a generic algorithm able to cope with both problems. The real-time estimation of on-board vision system pose—position and orientation—is a challenging task since (i) the sensor undergoes motions due to the vehicle dynamics and the road imperfections, and (ii) the viewed scene is unknown and continuously changing.
Of particular interest is the estimation of on-board camera’s position and orientation related to the 3D road plane. Note that since the 3D plane parameters are expressed in the camera coordinate system, the camera’s position and orientation are equivalent to the 3D plane parameters. Algorithms for fast road plane estimation are very useful for driver assistance applications as well as for augmented reality applications. For the former ones, the ability to use continuously updated plane parameters (vehicle pose) will considerably make the tasks of obstacles and objects detection more efficient [17], [33]. For the latter ones, one can for instance insert real or synthetic objects into the video captured by the on-board vision system based on the estimated road plane parameters. These continuously updated parameters provided by the vision sensor will make the inserted objects seem as a physical part of the scene. If the used road plane parameters are constant then the inserted object may suffer from misalignment whenever the actual plane parameters change due to the car’s dynamics and road’s imperfections. However, dealing with an urban scenario is more difficult than dealing with highways scenario since the prior knowledge as well as visual features are not always available in these scenes.
In general, monocular vision systems avoid problems related to 3D Euclidean geometry by using the prior knowledge of the environment as an extra source of information. For instance, (a) a road with a constant width is assumed [13], [12]; (b) the car is driven along two parallel lane markings, which are projected to the left and to the right of the image [25]; (c) after an initial calibration process the camera’s position and pitch angle remain constant through the time [28]; to mention a few.
Although prior knowledge has been extensively used to tackle the driver assistance problem, it may lead to wrong results. Hence, considering a constant camera’s position and orientation is not a valid assumption to be used in urban scenarios, since both of them are easily affected by road imperfections or artifacts (e.g., rough road, speed bumpers), car’s accelerations, uphill/downhill driving, among others. Facing up to this problem [13] introduces a technique for estimating vehicle’s yaw, pitch, and roll. However, since a single camera is used, this work is based on the assumption that some parts of the road have a constant width (e.g., lane markings). Similarly, Liang et al. [25] proposes to estimate camera’s orientation by assuming that the vehicle is driven along two parallel lane markings. Unfortunately, none of these two approaches can be generalized to be used in urban scenarios, since in general lanes are not as well defined as those of highways. In [32], authors use a single mounted camera. An extended Kalman filter has been used in order to infer a state vector including the vehicle rigid motion (six degrees of freedom) and the camera pose where the measurements are given by the eight-parameter planar motion field and the readings of the velocity and yaw rate sensors.
The use of prior knowledge has also been considered by some stereo vision based techniques to simplify the problem and to speed up the whole processing by reducing the amount of information to be handled. For instance, some of the aforementioned assumptions are also considered when stereo systems are used: flat road [6], [20] or constant camera pose [5], among others.
In the literature, many application-oriented stereo systems have been proposed. For instance, the edge based v-disparity approach proposed in [22], for an automatic estimation of horizon lines and later on used for applications such as obstacle or pedestrian detection (e.g., [4], [11], [21]), only computes 3D information over local maxima of the image gradient. A sparse disparity map is computed in order to obtain a real time performance. Additionally to the obstacle or pedestrian detection the authors present an atmospheric visibility measurement system using v-disparity information provided by stereo vision [18]. Recently, this v-disparity approach has been extended to a u–v-disparity concept in [19]. In this work, dense disparity maps are used instead of only relying on edge based disparity maps. Working in the disparity space is an interesting idea that is gaining popularity in on-board stereo vision applications, since planes in the original Euclidean space become straight lines in the disparity space.
In computer vision community, many works have addressed the detection and estimation of planes using images [1], [9], [27]. However, these works rely on feature extraction. This holds true for 3D camera motion methods (e.g., [24], [34]).
In [29], we proposed an approach for on-line vehicle pose estimation using a commercial stereo head. Although the proposed technique does not require the extraction of visual features in the images, it is based on dense depth maps and on the extraction of a dominant 3D plane that is assumed to be the road plane.
As can be seen, existing works adopt the following main stream. First, features are extracted either in the image space (optical flow, edges, ridges, interest points) or in the 3D Euclidean space (assuming the 3D data are built online). Second, a deterministic estimation is then invoked in order to recover the unknown parameters. In this paper, we propose a novel paradigm that is based on raw stereo images provided by a stereo head. Moreover, the new paradigm includes a stochastic technique since the aim is to track the vehicle pose parameters—the road plane parameters—given stereo pairs arriving in a sequential fashion.
The stochastic tracking relies on the particle filtering framework. The proposed technique could be indistinctly used for urban or highway environments, since it is not based on a specific visual traffic feature extraction neither in 2D nor in 3D. The use of particle filtering schemes is useful for obtaining a lock on the estimated parameters even when perturbing factors such as occlusions and video streaming discontinuities appear.
Our proposed method has a significant advantage over existing methods since it does not require road segmentation neither dense matching—two difficult and time-consuming tasks. Moreover, to the best of our knowledge, the work presented in this paper is the first work estimating road parameters directly from the rawbrightness images using a particle filter.
The rest of the paper is organized as follows. Section 2 describes the problem we are focusing on as well as some backgrounds. Section 3 briefly describes a 3D data-based method. Section 4 presents the proposed stochastic technique. Section 5 gives some experimental results and method comparisons. Section 6 provides a performance study using synthesized stereo sequences. In the sequel, the “road plane parameters” and the “pose parameters” will refer to the same entity.
Section snippets
Experimental setup
A commercial stereo vision system (Bumblebee from Point Grey1) was used. It consists of two Sony ICX084 color CCDs with 6 mm focal length lenses. Bumblebee is a pre-calibrated system that does not require in-field calibration. The baseline of the stereo head is 12 cm and it is connected to the computer by a IEEE-1394 connector. Right and left color images can be captured at a resolution of pixels and a frame rate near to 30 fps. This vision system includes a software able to
3D data-based approach
In [29], we have proposed an approach for on-line vehicle pose estimation using the above commercial stereo head. It aims to compute camera’s position and orientation—the road plane parameters. The proposed technique consists of two stages. First, a dense depth map of the scene is computed by the provided reconstruction software that utilizes a dense matching technique. Second, the parameters of a 3D plane fitting to the road are estimated using a RANSAC based least squares fitting. Moreover,
A featureless and stochastic approach
Our aim is to estimate the pose parameters from the stream of stereo pairs. In other words, we track the stereo head pose over time.3 In this section, we propose a novel approach that directly infers the plane parameters from the stereo pair using a particle filtering framework.
Experiments
The proposed technique has been tested on different urban environments. In this section, we will provide results obtained with three different videos associated with different road structures. Moreover, we provide a performance study using synthetic videos with ground-truth data.
Performance study
So far in this paper the evaluation of the proposed method has been carried out on real video sequences, including a comparison with a 3D data-based approach. However, it is very challenging to get ground-truth data for the on-board camera pose. In this section, we propose a simple scheme that can provide the ground-truth data for the road parameters. To this end, we use a 200-frame-long real video captured by the on-board stereo camera. For each stereo pair, we can fix the distance and the
Discussion and future work
A featureless and stochastic technique for real-time estimation of on-board stereo head pose has been presented. The method adopts a particle filtering scheme that uses images’ brightness in its observation likelihood. The advantages of the proposed technique are as follows. First, the technique does not need any feature extraction neither in the image domain nor in 3D space. Second, the technique inherits the strengths of stochastic tracking approaches. A good performance has been shown in
Acknowledgments
This work was partially supported by the Government of Spain under MEC project TRA2007-62526/AUT and research programme Consolider Ingenio 2010: MIPRCV (CSD2007-00018).
References (34)
- K.R.T. Aires, H. Araujo, A.A.D. Medeiros, Plane detection from monocular image sequences, in: Alvey Vision Conference,...
- J.M. Alvarez, A. López, R. Baldrich, Illuminant-invariant model-based road segmentation, in: IEEE Intelligent Vehicles...
- et al.
A tutorial on particle filters for on-line nonlinear/non-Gaussian Bayesian tracking
IEEE Transactions on Signal Processing
(2002) - M. Bertozzi, E. Binelli, A. Broggi, M. Del Rose, Stereo vision-based approaches for pedestrian detection, In:...
- et al.
GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection
IEEE Transactions on Image Processing
(1998) - M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli, A. Tibaldi, Shape-based pedestrian detection and...
- et al.
Pedestrian detection for driver assistance using multiresolution infrared vision
IEEE Transactions on Vehicular Technology
(2004) - et al.
Active Contours
(2000) - et al.
Identification and matching of planes in a pair of uncalibrated images
International Journal of Pattern Recognition Artificial Intelligence
(2003) - A. Broggi, M. Bertozzi, A. Fascioli, M. Sechi, Shape-based pedestrian detection, in: Proceedings of the IEEE...
Sequential Monte Carlo Methods in Practice
Three-Dimensional Computer Vision: a Geometric Viewpoint
The Geometry of Multiple Images
Multi-cue pedestrian detection and tracking from a moving vehicle
International Journal on Computer Vision
Cited by (10)
Reconstruction of non-rigid 3D shapes from stereo-motion
2011, Pattern Recognition LettersCitation Excerpt :Not to mention the recent revamp of stereo professional cameras in the film industry which are already affecting a novel set of consumer products (Ronfard and Taubin, 2010). The problem of recovering 3D structure using a stereo rig moving in time or a stereo rig looking at a moving object has been defined for the rigid case as the stereo-motion problem (Waxman and Duncan, 1986; Stein and Shashua, 1998; Dornaika and Sappa, 2009). As stated by Ho and Chung (1997) there are two visual cues that have to be taken into account when formulating the stere-motion reconstruction problem: the motion cue, in which 3D structure is recovered from the relative motion between the scene and the camera, and the stereo vision cue, in which 3D structure is recovered from the stereo pair images of the same scene.
Pose self-calibration of stereo vision systems for autonomous vehicle applications
2016, Sensors (Switzerland)High precision road segmentation for cover level of forward view estimation via stereo camera
2015, 2015 10th Asian Control Conference: Emerging Control Techniques for a Sustainable World, ASCC 2015An adaptive camera-selection algorithm to acquire higher-quality images
2015, Cluster ComputingA Multiple Feature-Based Image-Switching Strategy in Visual Sensor Networks
2015, International Journal of Distributed Sensor NetworksExtrinsic parameter self-calibration and nonlinear filtering for in-vehicle stereo vision systems at urban environments
2014, VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications