Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Classifying human actions in real-world videos is an open research problem with many applications in multimedia, surveillance, and robotics [1]. Its complexity arises from the variability of imaging conditions, motion, appearance, context, and interactions with persons, objects, or the environment over different spatio-temporal extents. Current state-of-the-art algorithms for action recognition are based on statistical models learned from manually labeled videos. They belong to two main categories: models relying on features hand-crafted for action recognition (e.g., [210]), or more recent end-to-end deep architectures (e.g., [1122]). These approaches have complementary strengths and weaknesses. Models based on hand-crafted features are data efficient, as they can easily incorporate structured prior knowledge (e.g., the importance of motion boundaries along dense trajectories [2]), but their lack of flexibility may impede their robustness or modeling capacity. Deep models make fewer assumptions and are learned end-to-end from data (e.g., using 3D-ConvNets [23]), but they rely on hand-crafted architectures and the acquisition of large manually labeled video datasets (e.g., Sports-1M [12]), a costly and error-prone process that poses optimization, engineering, and infrastructure challenges.

Although deep learning for videos has recently made significant improvements (e.g., [13, 14, 23]), models using hand-crafted features are the state of the art on many standard action recognition benchmarks (e.g., [7, 9, 10]). These models are generally based on improved Dense Trajectories (iDT) [3, 4] with Fisher Vector (FV) encoding [24, 25]. Recent deep models for action recognition therefore combine their predictions with complementary ones from iDT-FV for better performance [23, 26].

Fig. 1.
figure 1

Our hybrid unsupervised and supervised deep multi-layer architecture. Hand-crafted features are extracted along optical flow trajectories from original and generated videos. Those features are then normalized using RootSIFT [29], PCA-transformed, and augmented with their (xyt) coordinates, forming our low-level descriptors. The descriptors for each feature channel are then encoded (\(\phi \)) as Fisher Vectors, separately aggregated (\(\varSigma \)) into a video-level representation, square-rooted, and \(\ell _2\)-normalized. These representations are then concatenated (\(\cup \)) and renormalized. A dimensionality reduction layer is learned supervisedly or unsupervisedly. Supervised layers are followed by Batch-Normalization (BN) [30], ReLU (RL) non-linearities [31], and Dropout (DO) [32] during training. The last layer uses sigmoids (multi-label datasets) or softmax (multi-class datasets) non-linearities to produce action-label estimates.

In this paper, we study an alternative strategy to combine the best of both worlds via a single hybrid classification architecture consisting in chaining sequentially the iDT hand-crafted features, the unsupervised FV representation, unsupervised or supervised dimensionality reduction, and a supervised deep network (cf. Fig. 1). This family of models was shown by Perronnin and Larlus [27] to perform on par with the deep convolutional network of Krizhevsky et al. [28] for large scale image classification. We adapt this type of architecture differently for action recognition in videos with particular care for data efficiency.

Our first contribution consists in a careful design of the first unsupervised part of our hybrid architecture, which even with a simple SVM classifier is already on par with the state of the art. We experimentally observe that showing sympathy for the details (e.g., spatio-temporal structure, normalization) and doing data augmentation by feature stacking (instead of duplicating training samples) are critical for performance, and that optimal design decisions generalize across datasets.

Our second contribution consists in a data efficient hybrid architecture combining unsupervised representation layers with a deep network of multiple fully connected layers. We show that supervised mid-to-end learning of a dimensionality reduction layer together with non-linear classification layers yields an excellent compromise between recognition accuracy, model complexity, and transferability of the model across datasets thanks to reduced risks of overfitting and modern optimization techniques.

The paper is organized as follows. Section 2 reviews the related works in action recognition. Section 3 presents the details of the first unsupervised part (based on iDT-FV) of our hybrid model, while Sect. 4 does so for the rest of the architecture and our learning algorithm. In Sect. 5 we report experimental conclusions from parametric studies and comparisons to the state of the art on five widely used action recognition datasets of different sizes. In particular, we show that our hybrid architecture improves significantly upon the current state of the art, including recent combinations of iDT-FV predictions with deep models trained on millions of images and videos.

2 Related Work

Existing action recognition approaches (cf. [1] for a recent survey) can be organized into four broad categories based on whether they involve hand-crafted vs. deep-based video features, and a shallow vs. deep classifier, as summarized in Table 1.

Table 1. Categorization of related recent action recognition methods

Hand-crafted features, shallow classifier. A significant part of the progress on action recognition is driven by the development of local hand-crafted spatio-temporal features encoded as bag-of-words representations classified by “shallow” classifiers such as SVMs [210]. Most successful approaches use improved Dense Trajectories (iDT) [3] to aggregate local appearance and motion descriptors into a video-level representation through the Fisher Vector (FV) encoding [24, 25]. Local descriptors such as HOG [36], HOF [37], and MBH [2] are extracted along dense point trajectories obtained from optical flow fields. There are several recent improvements to iDT, for instance, using motion compensation [5, 6, 38, 39] and stacking of FVs to obtain a multi-layer encoding similar to mid-level representations [40]. To include global spatio-temporal location information, Wang et al. [5] compute FVs on a spatio-temporal pyramid (STP) [41] and use Spatial Fisher Vectors (SFV) [42]. Fernando et al. [10] model the global temporal evolution over the entire video using ranking machines learned on time-varying average FVs. Another recent improvement is the Multi-skIp Feature Stacking (MIFS) technique [7], which stacks features extracted at multiple frame-skips for better invariance to speed variations. An extensive study of the different steps of this general iDT pipeline and various feature fusion methods is provided in [8].

End-to-end learning: deep-based features, deep classifier. The seminal supervised deep learning approach of Krizhevsky et al. [28] has enabled impressive performance improvements on large scale image classification benchmarks, such as ImageNet [43], using Convolutional Neural Networks (CNN) [44]. Consequently, several approaches explored deep architectures for action recognition. While earlier works resorted to unsupervised learning of 3D spatio-temporal features [45], supervised end-to-end learning has recently gained popularity [1122]. Karpathy et al. [12] studied several architectures and fusion schemes to extend 2D CNNs to the time domain. Although trained on the very large Sports-1M dataset, their 3D networks performed only marginally better than single-frame models. To overcome the difficulty of learning spatio-temporal features jointly, the Two-Stream architecture [13] is composed of two CNNs trained independently, one for appearance modeling on RGB input, and another for temporal modeling on stacked optical flow. Sun et al. [14] factorize 3D CNNs into learning 2D spatial kernels, followed by 1D temporal ones. Alternatively, other recent works use recurrent neural networks (RNN) in conjunction with CNNs to encode the temporal evolution of actions [16, 17, 19]. Overall, due to the difficulty of training 3D-CNNs and the need for vast amounts of training videos (e.g., Sports-1M [12]), end-to-end methods report only marginal improvements over traditional baselines, and our experiments show that the iDT-FV often outperforms these approaches.

Deep-based features, shallow classifier. Several works [23, 26, 34, 35] explore the encoding of general-purpose deep-learned features in combination with “shallow” classifiers, transferring ideas from the iDT-FV algorithm. Zha et al. [34] combine CNN features trained on ImageNet [43] with iDT features through a Kernel SVM. The TDD approach [26] extracts per-frame convolutional feature maps from two-stream CNN [13] and pools these over spatio-temporal cubes along extracted trajectories. Similar to [12], C3D [23] learns general-purpose features using a 3D-CNN, but the final action classifier is a linear SVM. Like end-to-end deep models, these methods rely on large datasets to learn generic useful features, which in practice perform on par or worse than iDT.

Hybrid architectures: hand-crafted features, deep classifier. There is little work on using unsupervised encodings of hand-crafted local features in combination with a deep classifier. In early work, Baccouche et al. [33] learn temporal dynamics of traditional per-frame SIFT-BOW features using a RNN. The method, coupled with camera motion features, improves on BoW-SVM for a small set of soccer videos.

Our work lies in this category, as it combines the strengths of iDT-FV encodings and supervised deep multi-layer non-linear classifiers. Our method is inspired by the recently proposed hybrid image classification architecture of Perronnin and Larlus [27], who stack several unsupervised FV-based and supervised layers. Their hybrid architecture shows significant improvements over the standard FV pipeline, closing the gap on [28], which suggests there is still much to learn about FV-based methods.

Our work investigates this type of hybrid architectures, with several noticeable differences: (i) FV is on par with the current state of the art for action recognition, (ii) iDT features contain many different appearance and motion descriptors, which also results in more diverse and higher-dimensional FV, (iii) most action recognition training sets are small due to the cost of labeling and processing videos, so overfitting and data efficiency are major concerns. In this context, we adopt different techniques from modern hand-crafted and deep models, and perform a wide architecture and parameter study showing conclusions regarding many design choices specific to action recognition.

3 Fisher Vectors in Action: From Baseline to State of the Art

We first recall the iDT approach of Wang and Schmid [3], then describe the improvements that can be stacked together to transform this strong baseline into a state-of-the-art method for action recognition. In particular, we propose a data augmentation by feature stacking method motivated by MIFS [7] and data augmentation for deep models.

3.1 Improved Dense Trajectories

Local spatio-temporal features. The iDT approach used in many state-of-the-art action recognition algorithms (e.g., [35, 7, 8, 10, 40]) consists in first extracting dense trajectory video features [2] that efficiently capture appearance, motion, and spatio-temporal statistics. Trajectory shape (Traj) [2], HOG [36], HOF [37], and MBH [2] descriptors are extracted along trajectories obtained by median filtering dense optical flow. We extract dense trajectories from videos in the same way as in [3], applying RootSIFT normalization [29] (\(\ell _1\) normalization followed by square-rooting) to all descriptors.

Unsupervised representation learning. Before classification, we combine the multiple trajectory descriptors in a single video-level representation by accumulating their Fisher Vector encodings (FV) [24, 25], which was shown to be particularly effective for action recognition [5, 46]. This high-dimensional representation is based on the gradient of a generative model, a Gaussian Mixture Model (GMM), learned in an unsupervised manner on a large set of trajectory descriptors in our case. We use \(K=256\) Gaussians as a good compromise between accuracy and efficiency [35]. We randomly sample 256,000 trajectories from the pool of training videos, irrespectively of their labels, to learn one GMM per descriptor channel using 10 iterations of EM. Before learning the GMMs, we apply PCA to the descriptors, reducing their dimensionality by a factor of two. After learning the GMMs, we extract FV encodings for all descriptors in each descriptor channel and combine these encodings into a per-channel, video-level representation using sum-pooling, i.e. by adding FVs together before normalization. In addition, we apply further post-processing and normalization steps, as discussed in the next subsection.

Supervised classification. When using a linear classification model, we use a linear SVM. As it is standard practice and in order to ensure comparability with previous works [3, 7, 26, 47], we fix \(C = 100\) unless stated otherwise and use one-vs-rest for multi-class and multi-label classification. This forms a strong baseline for action recognition, as shown by previous works [5, 26] and confirmed in our experiments. We will now show how to make this baseline competitive with recent state-of-the-art methods.

3.2 Bag of Tricks for Bag-of-Words

Incorporating global spatio-temporal structure. Incorporating the spatio-temporal position of local features can improve the FV representation. We do not use spatio-temporal pyramids (STP) [41], as they significantly increase both the dimensionality of the representation and its variance [48]. Instead, we simply concatenate the PCA-transformed descriptors with their respective \((x,y,t) \in \mathbb {R}^3\) coordinates, as in [7, 48]. We refer to this method as Spatio-Temporal Augmentation (STA). This approach is linked to the Spatial Fisher Vector (SFV) [42], a compact model related to soft-assign pyramids, in which the descriptor generative model is extended to explicitly accommodate the (xyt) coordinates of the local descriptors. When the SFV is created using Gaussian spatial models (cf. Eq. 18 in [42]), the model becomes equivalent to a GMM created from augmented descriptors (assuming diagonal covariance matrices).

Normalization. We apply signed-square-rooting followed by \(\ell _2\) normalization, then concatenate all descriptor-specific FVs and reapply this same normalization, following [7]. The double normalization re-applies square rooting, and is thus equivalent to using a smaller power normalization [25], which improves action recognition performance [49].

Multi-Skip Feature Stacking (MIFS). MIFS [7] improves the robustness of FV to videos of different lengths by increasing the pool of features with frame-skipped versions of the same video. Standard iDT features are extracted from those frame-skipped versions and stacked together before descriptor encoding, decreasing the expectation and variance of the condition number [7, 50, 51] of the extracted feature matrices. We will now see that the mechanics of this technique can be expanded to other transformations.

3.3 Data Augmentation by Feature Stacking (DAFS)

Data augmentation is an important part of deep learning  [26, 52, 53], but it is rarely used with hand-crafted features and shallow classifiers, particularly for action recognition where duplicating training examples can vastly increase the computational cost. Common data augmentation techniques for images include the use of random horizontal flipping [26, 52], random cropping [52], and even automatically determined transformations [54]. For video classification, [9, 10] duplicate the training set by mirroring.

Instead, we propose to generalize MIFS to arbitrary transformations, an approach we call Data Augmentation by Feature Stacking (DAFS). First, we extract features from multiple transformations of an input video (frame-skipping, mirroring, etc.) that do not change its semantic category. Second, we obtain a large feature matrix by stacking the obtained spatio-temporal features prior to encoding. Third, we encode the feature matrix, pool the resulted encodings, and apply the aforementioned normalization steps along this pipeline to obtain a single augmented video-level representation.

This approach yields a representation that simplifies the learning problem, as it can improve the condition number of the feature matrix further than just MIFS thanks to leveraging data augmentation techniques traditionally used for deep learning. In contrast to data augmentation for deep approaches, however, we build a single more robust and useful representation instead of duplicating training examples. Note also that DAFS is particularly suited to FV-based representation of videos as pooling FV from a much larger set of features decreases one of the sources of variance for FV [55].

4 Hybrid Classification Architecture for Action Recognition

4.1 System Architecture

Our hybrid action recognition model combining FV with neural networks (cf. Fig. 1) starts with the previously described steps of our iDT-DAFS-FV pipeline, which can be seen as a set of unsupervised layers. The next part of our architecture consists of a set of L fully connected supervised layers, each comprising a dot-product followed by a non-linearity. Let \(h_{0}\) denote the FV output from the last unsupervised layer in our hybrid architecture, \(h_{j-1}\) the input of layer \(j \in \{1, ..., L\}\), \(h_{j} = g(W_j h_{j-1})\) its output, with \(W_j\) the corresponding parameter matrix to be learned. For intermediate hidden layers we use the Rectified Linear Unit (ReLU) non-linearity [31] for g. For the final output layer we use different non-linearity functions depending on the task. For multi-class classification over c classes, we use the softmax function \(g(z_i) = \exp (z_i)/\sum _{k=1}^c exp(z_k)\). For multi-label tasks we consider the sigmoid function \(g(z_i) = 1/(1 + exp(-z_i))\).

Connecting the last unsupervised layer to the first supervised layer can result in a much higher number of weights in this section than in all other layers of the architecture. Since this might be an issue for small datasets due to the higher risk of overfitting, we study the impact of different ways to learn the weights of this dimensionality reduction layer: either with unsupervised learning (e.g., using PCA as in [27]), or by learning a low-dimensional projection end-to-end with the next layers of the architecture.

4.2 Learning

Unsupervised layers. Our unsupervised layers are learned as described in Sect. 3.1.

Supervised layers. We use the standard cross-entropy between the network output \(\hat{y}\) and the ground-truth label vectors y as loss function. For multi-class classification problems, we minimize the categorical cross-entropy cost function over all n samples:

$$\begin{aligned} C_{cat}(y, \hat{y}) = - \sum _{i=1}^n \sum _{k=1}^{c} y_{ik} log(\hat{y}_{ik}), \end{aligned}$$
(1)

whereas for multi-label problems we minimize the binary cross-entropy:

$$\begin{aligned} C_{bin}(y, \hat{y}) = - \sum _{i=1}^n \sum _{k=1}^{c} y_{ik} log(\hat{y}_{ik}) - (1-y_{ik}) log(1-\hat{y}_{ik}). \end{aligned}$$
(2)

Optimization. For parameter optimization we use the recently introduced Adam algorithm [56]. Since Adam automatically computes individual adaptive learning rates for the different parameters of our model, this alleviates the need for fine-tuning of the learning rate with a costly grid-search or similar methods. Adam uses estimates of the first and second-order moments of the gradients in the update rule:

$$\begin{aligned} \theta _t \leftarrow \theta _{t-1} - \alpha \cdot \frac{m_t}{(1-\beta _1^t)\sqrt{\frac{v_t}{1-\beta _2^t}}+\epsilon } \quad \text {where}\quad \begin{matrix} g_t \leftarrow \nabla _{\theta }f(\theta _{t-1}) \\ m_t \leftarrow \beta _1 \cdot m_{t-1} + (1 - \beta _1) \cdot g_t \\ v_t \leftarrow \beta _2 \cdot v_{t-1} + (1 - \beta _2) \cdot {g_t}^2 \\ \end{matrix} \end{aligned}$$
(3)

and where \(f(\theta )\) is the function with parameters \(\theta \) to be optimized, t is the index of the current iteration, \(m_0 = 0\), \(v_0 = 0\), and \(\beta _1^t\) and \(\beta _2^t\) denotes \(\beta _1\) and \(\beta _2\) to the power of t, respectively. We use the default values for its parameters \(\alpha = 0.001\), \(\beta _1 = 0.9\), \(\beta _2~=~0.999\), and \(\epsilon = 10^{-8}\) proposed in [56] and implemented in Keras [57].

Batch normalization and regularization. During learning, we use batch normalization (BN) [30] and dropout (DO) [32]. Each BN layer is placed immediately before the ReLU non-linearity and parametrized by two vectors \(\gamma \) and \(\beta \) learned alongside each fully-connected layer. Given a training set \(X = \{x_1, x_2, ..., x_n\}\) of n training samples, the transformation learned by BN for each input vector \(x \in X\) is given by:

$$\begin{aligned} BN(x; \gamma , \beta ) = \gamma \frac{x - \mu _{B}}{\sqrt{\sigma _{B}^2 + \epsilon }} + \beta \ \ \text {where} \ \ \ \mu _{B} \leftarrow \frac{1}{n} \sum _{i=1}^n x_i \ , \quad \sigma _{B}^2 \leftarrow \frac{1}{n} \sum _{i=1}^n (x_i - \mu _{B})^2 \end{aligned}$$
(4)

Together with DO, the operation performed by hidden layer j can now be expressed as \(h_{j} = r \odot g(BN(W_j h_{j-1}; \gamma _j, \beta _j))\), where r is a vector of Bernoulli-distributed variables with probability p and \(\odot \) denotes the element-wise product. We use the same DO rate p for all layers. The last output layer is not affected by this modification.

Dimensionality reduction layer. When unsupervised, we fix the weights of the dimensionality reduction layer from the projection matrices learned by PCA dimensionality reduction followed by whitening and \(\ell _2\) normalization [27]. When it is supervised, it is treated as the first fully-connected layer, to which we apply BN and DO as with the rest of the supervised layers.

Bagging. Since our first unsupervised layers can be fixed, we can train ensemble models and average their predictions very efficiently for bagging purposes [27, 58, 59] by caching the output of the unsupervised layers and reusing it in the subsequent models.

5 Experiments

We first describe the datasets used in our experiments, then provide a detailed analysis of the iDT-FV pipeline and our proposed improvements. Based on our observations, we then perform an ablative analysis of our proposed hybrid architecture. Finally, we study the transferability of our hybrid models, and compare to the state of the art.

5.1 Datasets

We use five publicly available and commonly used datasets for action recognition. We briefly describe these datasets and their evaluation protocols.

The Hollywood2 [60] dataset contains 1,707 videos extracted from 69 Hollywood movies, distributed over 12 overlapping action classes. As one video can have multiple class labels, results are reported using the mean average precision (mAP).

The HMDB-51 [61] dataset contains 6,849 videos distributed over 51 distinct action categories. Each class contains at least 101 videos and presents a high intra-class variability. The evaluation protocol is the average accuracy over three fixed splits [61].

The UCF-101 [62] dataset contains 13,320 video clips distributed over 101 distinct classes. This is the same dataset used in the THUMOS’13 challenge [63]. The performance is again measured as the average accuracy on three fixed splits.

The Olympics [64] dataset contains 783 videos of athletes performing 16 different sport actions, with 50 sequences per class. Some actions include interactions with objects, such as Throwing, Bowling, and Weightlifting. Following [3, 7], we report mAP over the train/test split released with the dataset.

The High-Five (a.k.a. TVHI) [65] dataset contains 300 videos from 23 different TV shows distributed over four different human interactions and a negative (no-interaction) class. As in [5, 6, 65, 66], we report mAP for the positive classes (mAP+) using the train/test split provided by the dataset authors.

5.2 Detailed Study of Dense Trajectory Baselines for Action Recognition

Table 2 reports our results comparing the iDT baseline (Sect. 3.1), its improvements discussed in Sect. 3.2, and our proposed data augmentation strategy (Sect. 3.3).

Reproducibility. We first note that there are multiple differences in the iDT pipelines used across the literature. While [3] applies RootSIFT only on HOG, HOF, and MBH, in [7] this normalization is also applied to the Traj descriptor. While [3] includes Traj in their pipeline, [5] omits it. Additionally, person bounding boxes are used to ignore human motions when doing camera motion compensation in [5], but are not publicly available for all datasets. Therefore, we reimplemented the main baselines and compare our results to the officially published ones. As shown in Table 2, we successfully reproduce the original iDT results from [3, 47], as well as the MIFS results of [7].

Improvements of iDT. Table 2 shows that double-normalization (DN) alone improves performance over iDT on most datasets without the help of STA. We show that STA gives comparable results to SFV+STP, as hypothesized in Sect. 3.2. Given that STA and DN are both beneficial for performance, we combine them with our own method.

Data Augmentation by Feature Stacking (DAFS). Although more sophisticated transformations can be used, we found that combining a limited number of simple transformations already allows to significantly improve the iDT-based methods in conjunction with the aforementioned improvements, as shown in the “iDT+STA+DAFS+DN” line of Table 2. In practice, we generate on-the-fly 7 different versions for each video, considering the possible combinations of frame-skipping up to level 3 and horizontal flipping.

Fine tuned and non-linear SVMs. Attempting to improve our best results, we also performed experiments both fine-tuning C and also using a Gaussian kernel while fine-tuning \(\gamma \). However, we found that those two sets of experiments did not lead to significant improvements. As DAFS already brings results competitive with the current state of the art, we set those results with fixed C as our current shallow baseline (FV-SVM). We will now incorporate those techniques in the first unsupervised layers of our hybrid models.

Table 2. Analysis of iDT baselines and several improvements
Table 3. Top-5 best performing hybrid architectures with consistent improvements
Fig. 2.
figure 2

Parallel coordinates plots showing the impact of multiple parameters. Each line represents one combination of parameters and color indicates performance of our hybrid architectures with unsupervised dimensionality reduction. Depth 2 correlates with high-performing architectures, whereas a small width and a large depth is suboptimal. (Color figure online)

5.3 Analysis of Hybrid Models

In this section, we start from hybrid architectures with unsupervised dimensionality reduction learned by PCA. For UCF-101 (the largest dataset) we initialize \(W_1\) with \(r=4096\) dimensions, whereas for all other datasets we use the number of dimensions responsible for \(99\,\%\) of the variance (yielding less dimensions than training samples).

We study the interactions between four parameters that can influence the performance of our hybrid models: the output dimension of the intermediate fully connected layers (width), the number of layers (depth), the dropout rate, and the mini-batch size of Adam (batch). We systematically evaluate all possible combinations and rank the architectures by the average relative improvement w.r.t. the best FV-SVM model. Training all 480 combinations for one split of UCF-101 can be accomplished in less than two days with a single Tesla K80 GPU. We report the top results in Table 3 and visualize all results using the parallel coordinates plot in Fig. 2. Our observations are as follows.

Unsupervised dimensionality reduction. Performing dimensionality reduction using the weight matrix from PCA is beneficial for all datasets, and using this layer alone, achieves 1.28 % average improvement (Table 3, depth 1) upon our best SVM baseline.

Width. We consider networks with fully connected layers of size 512, 1024, 2048, and 4096. We find that a large width (4096) gives the best results in 4 of 5 datasets.

Depth. We consider hybrid architectures with depth between 1 and 4. Most well-performing models have depth 2 as shown in Fig. 2, but one layer is enough for the big datasets.

Dropout rate. We consider dropout rates from 0 to 0.9. We find dropout to be dependent of both architecture and dataset. A high dropout rate significantly impairs classification results when combined with a small width and a large depth.

Mini-batch size. We consider mini-batch sizes of 128, 256, and 512. We find lower batch sizes to bring best results, with 128 being the more consistent across all datasets. We observed that large batch sizes were detrimental to networks with a small width.

Best configuration with unsupervised dimensionality reduction. We find the following parameters to work the best: small batch sizes, a large width, moderate depth, and dataset-dependent dropout rates. The most consistent improvements across datasets are with a network with batch-size 128, width 4096, and depth 2.

Supervised dimensionality reduction. Our previous findings indicate that the dimensionality reduction layer can have a large influence on the overall classification results. Therefore, we investigate whether a supervised dimensionality reduction layer trained mid-to-end with the rest of the architecture could improve results further. Due to memory limitations imposed by the higher number of weights to be learned between our 116K-dimensional input FV representation and the intermediate fully-connected layers, we decrease the maximum network width to 1024. In spite of this limitation, our results in Table 4 show that much smaller hybrid architectures with supervised dimensionality reduction improve (on the larger UCF-101 and HMDB-51 datasets) or maintain (on the other smaller datasets) recognition performance.

Table 4. Supervised dimensionality reduction hybrid architecture evaluation

Comparison to hybrid models for image recognition. Our experimental conclusions and optimal model differ from [27], both on unsupervised and supervised learning details (e.g., dropout rate, batch size, learning algorithm), and in the usefulness of a supervised dimensionality reduction layer trained mid-to-end (not explored in [27]).

5.4 Transferability of Hybrid Models

In this section, we study whether the first layers of our architecture can be transferred across datasets. As a reference point, we use the first split of UCF-101 to create a base model and transfer elements from it to other datasets. We chose UCF-101 for the following reasons: it is the largest dataset, has the largest diversity in number of actions, and contains multiple categories of actions, including human-object interaction, human-human interaction, body-motion interaction, and practicing sports.

Unsupervised representation layers. We start by replacing the dataset-specific GMMs with the GMMs from the base model. Our results in the second row of Table 5 show that the transferred GMMs give similar performance to the ones using dataset-specific GMMs. This, therefore, greatly simplifies the task of learning a new model for a new dataset. We keep the transferred GMMs fixed in the next experiments.

Unsupervised dimensionality reduction layer. Instead of configuring the unsupervised dimensionality reduction layer with weights from the PCA learned on its own dataset, we configure it with the weights learned in UCF-101. Our results are in the third row of Table 5. This time we observe a different behavior: for Hollywood2 and HMDB-51, the best models were found without transfer, whereas for Olympics it did not have any measurable impact. However, transferring PCA weights brings significant improvement in High-Five. One of the reasons for this improvement is the evidently smaller training set size of High-Five (150 samples) in contrast to other datasets. The fact that the improvement becomes less visible as the number of samples in each dataset increases (before eventually degrading performance) indicates there is a threshold below which transferring starts to be beneficial (around a few hundred training videos).

Supervised layers after unsupervised reduction. We also study the transferability of further layers in our architecture, after the unsupervised dimensionality reduction transfer. We take the base model learned in the first split of UCF-101, remove its last classification layer, re-insert a classification layer with the same number of classes as the target dataset, and fine-tune this new model in the target dataset, using an order of magnitude lower learning rate. The results can be seen in the last row of Table 5. The same behavior is observed for HMDB-51 and Hollywood2. However, we notice a decrease in performance for High-Five and a performance increase for Olympics. We attribute this to the presence of many sports-related classes in UCF-101.

Mid-to-end reduction and supervised layers. Finally, we study whether the architecture with supervised dimensionality reduction layer transfers across datasets, as we did for the unsupervised layers. We again replace the last classification layer from the corresponding model learned on the first split of UCF-101, and fine-tune the whole architecture on the target dataset. Our results in the second and third rows of Table 6 show that transferring this architecture brings improvements for Olympics and HMDB-51, but performs worse than transferring unsupervised layers only on High-Five.

Table 5. Transferability experiments involving unsupervised dimensionality reduction
Table 6. Transferability experiments involving supervised dimensionality reduction

5.5 Comparison to the State of the Art

In this section, we compare our best models found previously to the state of the art.

Best models. For UCF-101, the most effective model leverages its large training set using supervised dimensionality reduction (cf. Table 4). For HMDB-51 and Olympics, the best models result from transferring the supervised dimensionality reduction models from the related UCF-101 dataset (cf. Table 6). Due to its specificity, the best architecture for Hollywood2 is based on unsupervised dimensionality reduction learned on its own data (cf. Table 3), although there are similarly-performing end-to-end transferred models (cf. Table 6). For High-Five, the best model is obtained by transferring the unsupervised dimensionality reduction models from UCF-101 (cf. Table 5).

Bagging. As it is standard practice [27], we take the best models and perform bagging with 8 models initialized with distinct random initializations. This improves results by around one point on average, and our final results are in Table 7.

Discussion. In contrast to [27], our models outperform the state of the art, including methods trained on massive labeled datasets like ImageNet or Sports-1M, confirming both the excellent performance and the data efficiency of our approach. Table 8 illustrates some failure cases of our methods. Confusion matrices and precision-recall curves for all datasets are available in the supplementary material for fine-grained analysis.

Table 7. Comparison against the state of the art in action recognition
Table 8. Top-5 most confused classes for our best FV-SVM and Hybrid models

6 Conclusion

We investigate hybrid architectures for action recognition, effectively combining hand-crafted spatio-temporal features, unsupervised representation learning based on the FV encoding, and deep neural networks. In addition to paying attention to important details like normalization and spatio-temporal structure, we integrate data augmentation at the feature level, end-to-end supervised dimensionality reduction, and modern optimization and regularization techniques. We perform an extensive experimental analysis on a variety of datasets, showing that our hybrid architecture yields data efficient, transferable models of small size that yet outperform much more complex deep architectures trained end-to-end on millions of images and videos. We believe our results open interesting new perspectives to design even more advanced hybrid models, e.g., using recurrent neural networks, targeting better accuracy, data efficiency, and transferability.