Skip to main content
Log in

Coloring Action Recognition in Still Images

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this article we investigate the problem of human action recognition in static images. By action recognition we intend a class of problems which includes both action classification and action detection (i.e. simultaneous localization and classification). Bag-of-words image representations yield promising results for action classification, and deformable part models perform very well object detection. The representations for action recognition typically use only shape cues and ignore color information. Inspired by the recent success of color in image classification and object detection, we investigate the potential of color for action classification and detection in static images. We perform a comprehensive evaluation of color descriptors and fusion approaches for action recognition. Experiments were conducted on the three datasets most used for benchmarking action recognition in still images: Willow, PASCAL VOC 2010 and Stanford-40. Our experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors. Our experiments demonstrate that late fusion of color and shape information outperforms other approaches on action recognition. Finally, we show that the different color–shape fusion approaches result in complementary information and combining them yields state-of-the-art performance for action classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. As evidenced by the First Workshop on Action Recognition and Pose Estimation in Still Images held in conjunction with ECCV 2012: http://vision.stanford.edu/apsi2012/index.html.

  2. Only in experiment 6.2.3 do we add additional information from the background.

  3. Note that the terminology of early and late fusion varies. In some communities early fusion refers to combination before the classifier and late fusion to combination after the classifier (Lan et al. 2012).

  4. The combined vocabulary \(sc\) is constructed by concatenating the shape and color features before constructing the vocabulary in the combined feature-space.

  5. Due to the absence of a vocabulary stage, several of the fusion methods explained in Sect. 4 cannot be applied to part-based object detection.

  6. The Willow dataset is available at: http://www.di.ens.fr/willow/research/stillactions/.

  7. PASCAL 2010 is available at: http://www.pascal-network.org/challenges/VOC/voc2010/.

  8. The Stanford-40 dataset is available at http://vision.stanford.edu/Datasets/40actions.html.

  9. We also performed experiments replacing HOG with pure color descriptors but significantly inferior results were obtained.

  10. The confusion matrix is constructed by assigning each image to the class for which it gets the highest classification score.

  11. Top ranked detections are the top \(N_j\) detections of a class, where \(N_j\) is equal to the number of positive examples for that class.

References

  • Benavente, R., Vanrell, M., & Baldrich, R. (2008). Parametric fuzzy sets for automatic color naming. Journal of the Optical Society of America A, 25(10), 2582–2593.

    Article  Google Scholar 

  • Berlin, B., & Kay, P. (1969). Basic color terms: Their universality and evolution. Berkeley, CA: University of California Press.

    Google Scholar 

  • Bosch, A., Zisserman, A., & Munoz, X. (2006). Scene classification via plsa. In Proceedings of the European conference on computer vision.

  • Bosch, A., Zisserman, A., & Munoz, X. (2008). Scene classification using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 712–727.

    Article  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Conference on computer vision and pattern recognition.

  • Delaitre, V., Laptev, I., & Sivic, J. (2010). Recognizing human actions in still images: a study of bag-of-features and part-based representations. In Proceedings of the British machine vision conference.

  • Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person-object interactions for action recognition in still images. In Advances in neural information processing systems.

  • Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In Proceedings of the European conference on computer vision

  • Elfiky, N., Khan, F. S., van de Weijer, J., & Gonzalez, J. (2012). Discriminative compact pyramids for object and scene recognition. Pattern Recognition, 45(4), 1627–1636.

    Article  MATH  Google Scholar 

  • Everingham, M., Gool, L.V., Williams, C.K.I., JWinn, Zisserman A. (2009). The pascal visual object classes challenge 2009 (VOC2009) results.

  • Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Google Scholar 

  • Felsberg, M., & Hedborg, J. (2007). Real-time view-based pose recognition and interpolation for tracking initialization. Journal of Real-Time Image Processing, 2(3), 103–115.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In Conference on computer vision and pattern recognition.

  • Gehler, P. V., & Nowozin, S. (2009). On feature combination for multiclass object classification. In Proceedings of IEEE international conference on computer vision.

  • Geusebroek, J. M., van den Boomgaard, R., Smeulders, A. W. M., & Geerts, H. (2001). Color invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1338–1350.

    Article  Google Scholar 

  • Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In European conference on computer vision.

  • Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., & Huang, T. S. (2009). Action detection in complex scenes with spatial and temporal ambiguities. In Proceedings of IEEE international conference on computer vision.

  • Khan, F. S., van de Weijer, J., Bagdanov, A. D., & Vanrell, M. (2011). Portmanteau vocabularies for multi-cue image representations. In Advances in neural information processing systems.

  • Khan, F. S., Anwer, R. M., van de Weijer, J., Bagdanov, A. D., Vanrell, M., & Lopez, A. M. (2012a). Color attributes for object detection. In Conference on computer vision and pattern recognition.

  • Khan, F. S., van de Weijer, J., & Vanrell, M. (2012b). Modulating shape features by color attention for object recognition. International Journal of Computer Vision, 98(1), 49–64.

    Article  Google Scholar 

  • Lan, Z. Z., Bao, L., Yu, S. I., Liu, W., & Hauptmann, A. G. (2012). Double fusion for multimedia event detection. In Multimedia Modeling.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision & pattern recognition.

  • Lenz, R., Bui, T. H., & Hernandez-Andres, J. (2005). Group theoretical structure of spectral spaces. Journal of Mathematical Imaging and Vision, 23(3), 297–313.

    Article  MathSciNet  Google Scholar 

  • Li, L. J., Su, H., Xing, E. P., & Li, F. F. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In Advances in neural information processing systems.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant points. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Maji, S., Bourdev, L. D., & Malik, J. (2011). Action recognition from a distributed representation of pose and appearance. In Computer vision and pattern recognition.

  • Mullen, K. T. (1985). The contrast sensitivity of human colour vision to red–green and blue–yellow chromatic gratings. The Journal of Physiology, 359, 381–400.

    Google Scholar 

  • Pagani, A., Stricker, D., & Felsberg, M. (2009). Integral p-channels for fast and robust region matching. In Proceedings of international consortium for intergenerational programmes.

  • Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 601–614.

    Google Scholar 

  • van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.

    Google Scholar 

  • Shapovalova, N., Gong, W., Pedersoli, M., Roca, F. X., & Gonzalez, J. (2011). On importance of interactions and context in human action recognition. In Iberian conference on pattern recognition and image analysis.

  • Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative spatial saliency for image classification. In Conference on computer vision and pattern recognition.

  • Sharma, G., Jurie, F., & Schmid, C. (2013). Expanded parts model for human attribute and action recognition in still images. In Conference on computer vision and pattern recognition.

  • Tran, D., & Yuan, J. (2012). Max-margin structured output regression for spatio-temporal action localization. In Advances in neural information processing systems.

  • Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In Proceedings of IEEE international conference on computer vision.

  • Vigo, D. A. R., Khan, F. S., van de Weijer, J. & Gevers, T. (2010). The impact of color on bag-of-words based object recognition. In Indian council of philosophical research.

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. S., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In Conference on computer vision and pattern recognition.

  • van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the European conference on computer vision.

  • van de Weijer, J., & Schmid, C. (2007). Applying color names to image description. In International consortium for intergenerational programmes.

  • van de Weijer, J., Schmid, C., Verbeek, J. J., & Larlus, D. (2009). Learning color names for real-world applications. IEEE Transaction in Image Processing (TIP), 18(7), 1512–1524.

    Article  Google Scholar 

  • Yao, B., & Li, F. F. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1691–1703.

    Article  MathSciNet  Google Scholar 

  • Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Li, F. F. (2011). Human action recognition by learning bases of action attributes and parts. In Proceedings of IEEE international conference on computer vision.

  • Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1728–1743.

    Article  Google Scholar 

  • Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object catergories: An in-depth study. A comprehensive study. International Journal of Computer Vision, 73(2), 213–218.

    Article  Google Scholar 

  • Zhang, J., Huang, K., Yu, Y., & Tan, T. (2010). Boosted local structured hog-lbp for object localization. In IEEE conference on computer vision & pattern recognition.

Download references

Acknowledgments

We gratefully acknowledge the support of MNEMOSYNE (POR-FSE 2007-2013, A.IV-OB.2), Collaborative Unmanned Aerial Systems (within the Linnaeus environment CADICS), ELLIIT, the Strategic Area for ICT research, funded by the Swedish Government, and Spanish projects TRA2011-29454-C03-01 and TIN2009-14173.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fahad Shahbaz Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, F.S., Muhammad Anwer, R., van de Weijer, J. et al. Coloring Action Recognition in Still Images. Int J Comput Vis 105, 205–221 (2013). https://doi.org/10.1007/s11263-013-0633-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0633-0

Keywords

Navigation