Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

de Souza, César Roberto; Gaidon, Adrien; Cabon, Yohann; Murray, Naila; López, Antonio Manuel

doi:10.1007/s11263-019-01222-z

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Published: 23 October 2019

Volume 128, pages 1505–1536, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

César Roberto de Souza¹,
Adrien Gaidon²,
Yohann Cabon¹,
Naila Murray¹ &
…
Antonio Manuel López³

950 Accesses
13 Citations
Explore all metrics

Abstract

Deep video action recognition models have been highly successful in recent years but require large quantities of manually-annotated data, which are expensive and laborious to obtain. In this work, we investigate the generation of synthetic training data for video action recognition, as synthetic data have been successfully used to supervise models for a variety of other computer vision tasks. We propose an interpretable parametric generative model of human action videos that relies on procedural generation, physics models and other components of modern game engines. With this model we generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for “Procedural Human Action Videos”. PHAV contains a total of 39,982 videos, with more than 1000 examples for each of 35 action categories. Our video generation approach is not limited to existing motion capture sequences: 14 of these 35 categories are procedurally-defined synthetic actions. In addition, each video is represented with 6 different data modalities, including RGB, optical flow and pixel-level semantic labels. These modalities are generated almost simultaneously using the Multiple Render Targets feature of modern GPUs. In order to leverage PHAV, we introduce a deep multi-task (i.e. that considers action classes from multiple datasets) representation learning architecture that is able to simultaneously learn from synthetic and real video datasets, even when their action categories differ. Our experiments on the UCF-101 and HMDB-51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition performance. Our approach also significantly outperforms video representations produced by fine-tuning state-of-the-art unsupervised generative models of videos.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Video Generation, Prediction and Completion of Human Action Sequences

Action2video: Generating Videos of Human 3D Actions

Article 04 January 2022

Chuan Guo, Xinxin Zuo, … Li Cheng

Unsupervised video-based action recognition using two-stream generative adversarial network

Article 26 December 2023

Wei Lin, Huanqiang Zeng, … Kai-Kuang Ma

Notes

Dataset and tools are available for download in http://adas.cvc.uab.es/phav/.
RootMotion’s PuppetMaster is an advanced active ragdoll physics asset for Unity\(^{\textregistered }\). For more details, please see http://root-motion.com.
The Accord.NET Framework is a framework for image processing, computer vision, machine learning, statistics, and general scientific computing in .NET. It is available for most .NET platforms, including Unity\(^{\textregistered }\). For more details, see http://accord-framework.net.
Please note that a base motion can be assigned to more than one category, and therefore columns of this matrix do not necessarily sum up to one. An example is “car hit”, which could use motions that may belong to almost any other category (e.g., “run”, “walk”, “clap”) as long as the character gets hit by a car during its execution.
http://github.com/yjxiong/temporal-segment-networks.

References

Abdulnabi, A. H., Wang, G., Lu, J., & Jia, K. (2015). Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia, 17(11), 1949–1959.
Article Google Scholar
Asensio, J. M. L., Peralta, J., Arrabales, R., Bedia, M. G., Cortez, P., & López, A. (2014). Artificial intelligence approaches for the generation and assessment of believable human-like behaviour in virtual characters. Expert Systems With Applications, 41(16), 1781–7290.
Google Scholar
Aubry, M., & Russell, B. (2015). Understanding deep features with computer-generated imagery. In ICCV.
Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
MATH Google Scholar
Brostow, G., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(20), 88–97.
Article Google Scholar
Butler, D., Wulff, J., Stanley, G., & Black, M. (2012). A naturalistic open source movie for optical flow evaluation. In ECCV.
Carnegie Mellon Graphics Lab. (2016). Carnegie Mellon University motion capture database.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR.
Carter, M. P. (1997). Computer graphics: principles and practice (Vol. 22). Boston: Addison-Wesley Professional.
Google Scholar
Chen, C., Seff, A., Kornhauser, A., & Xiao, J. (2015). DeepDriving: Learning affordance for direct perception in autonomous driving. In ICCV.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. T-PAMI, 40(4), 834–848.
Article Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
De Souza, C. R. (2014). The Accord.NET framework, a framework for scientific computing in .NET. http://accord-framework.net.
De Souza, C. R., Gaidon, A., Vig, E., & López, A. M. (2016). Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In ECCV.
De Souza, C. R., Gaidon, A., Cabon, Y., & López, A. M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the 1st annual conference on robot learning.
Egges, A., Kamphuis, A., & Overmars, M. (Eds.). (2008). Motion in Games: First International Workshop, MIG 2008, Utrecht, The Netherlands, June 14–17, 2008, Revised Papers (Vol. 5277). Springer.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.
Fernando, B., Gavves, E., Oramas, M. J., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.
Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. T-PAMI, 35(11), 2782–2795.
Article Google Scholar
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In CVPR.
Galvane, Q., Christie, M., Lino, C., & Ronfard, R. (2015). Camera-on-rails: Automated computation of constrained camera paths. In SIGGRAPH.
Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In CVPR.
Gu, C., Sun, C., Ross, D., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR.
Guay, M., Ronfard, R., Gleicher, M., Cani, M. P. (2015a). Adding dynamics to sketch-based character animations. In Sketch-based interfaces and modeling.
Guay, M., Ronfard, R., Gleicher, M., & Cani, M. P. (2015b). Space-time sketching of character animation. ACM Transactions on Graphics, 34(4), 118.
Article Google Scholar
Haeusler, R., & Kondermann, D. (2013). Synthesizing real world stereo challenges. In German conference on pattern recognition
Haltakov, V., Unger, C., & Ilic, S. (2013). Framework for generation of synthetic ground truth data for driver assistance applications. In German conference on pattern recognition.
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2015). SynthCam3D: Semantic understanding with synthetic indoor scenes. CoRR. arXiv:1505.00171.
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In CVPR.
Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR.
Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015) Learning scene-specific pedestrian detectors without real data. In CVPR.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML (Vol. 37).
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Jiang, Y. G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes.
Kaneva, B., Torralba, A., & Freeman, W. (2011). Evaluation of image features using a photorealistic virtual world. In ICCV.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kuehne, H., Jhuang, H. H., Garrote-Contreras, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Lan, Z., Lin, M., Li, X., Hauptmann, A. G., & Raj, B. (2015). Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.
Langer, M. S., & Bülthoff, H. H. (2000). Depth discrimination from shading under diffuse lighting. Perception, 29(6), 649–660.
Article Google Scholar
Lerer, A., Gross, S., & Fergus, R. (2016). Learning physical intuition of block towers by example. In Proceedings of machine learning research (Vol. 48).
Li, Y., Min, M. R., Shen, D., Carlson, D. E., & Carin, L. (2018). Video generation from text. In AAAI.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In ECCV.
Marín, J., Vázquez, D., Gerónimo, D., & López, A. M. (2010). Learning appearance in virtual scenarios for pedestrian detection. In CVPR.
Marwah, T., Mittal, G., & Balasubramanian, V. N. (2017). Attentive semantic video generation using captions. In ICCV.
Massa, F., Russell, B., & Aubry, M. (2016). Deep exemplar 2D–3D detection by adapting from real to rendered views. In CVPR.
Matikainen, P., Sukthankar, R., & Hebert, M. (2011). Feature seeding for action recognition. In ICCV.
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR.
Meister, S., & Kondermann, D. (2011). Real versus realistically rendered scenes for optical flow evaluation. In CEMT.
Miller, G. (1994). Efficient algorithms for local and global accessibility shading. In SIGGRAPH.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing Atari with deep reinforcement learning. In NIPS workshops.
Molnar, S. (1991). Efficient supersampling antialiasing for high-performance architectures. Technical report, North Carolina University at Chapel Hill.
Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., & Xu, C. (2017). Learning explicit video attributes from mid-level representation for video captioning. Computer Vision and Image Understanding, 163, 126–138.
Article Google Scholar
Onkarappa, N., & Sappa, A. (2015). Synthetic sequences and ground-truth flow field generation for algorithm validation. Multimedia Tools and Applications, 74(9), 3121–3135.
Article Google Scholar
Papon, J., & Schoeler, M. (2015). Semantic pose using deep networks trained on synthetic RGB-D. In ICCV.
Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014). Action recognition with stacked fisher vectors. In ECCV.
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV.
Perlin, K. (1995). Real time responsive animation with personality. IEEE Transactions on Visualization and Computer Graphics, 1(1), 5–15.
Article Google Scholar
Perlin, K., & Seidman, G. (2008). Autonomous digital actors. In Motion in games.
Richter, S., Vineet, V., Roth, S., & Vladlen, K. (2016). Playing for data: Ground truth from computer games. In ECCV.
Ritschel, T., Grosch, T., & Seidel, H. P. (2009). Approximating dynamic global illumination in image space. In Proceedings of the 2009 symposium on interactive 3D graphics and games—I3D ’09.
Ros, G., Sellart, L., Materzyska, J., Vázquez, D., & López, A. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.
Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Selan, J. (2012). Cinematic color. In SIGGRAPH.
Shafaei, A., Little, J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. In BMVC.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Sizikova1, E., Singh, V. K., Georgescu, B., Halber, M., Ma, K., & Chen, T. (2016). Enhancing place recognition using joint intensity—depth analysis and synthetic data. In ECCV workshops.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR. arXiv:1212.0402.
Sousa, T., Kasyan, N., & Schulz, N. (2011). Secrets of cryengine 3 graphics technology. In SIGGRAPH.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal on Machine Learning Research, 15, 1929–1958.
MathSciNet MATH Google Scholar
Steiner, B. (2011). Post processing effects. Institute of Graphics and Algorithms, Vienna University of Technology, Bachelour’s thesis.
Su, H., Qi, C., Yi, Y., & Guibas, L. (2015a). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV.
Su, H., Wang, F., Yi, Y., & Guibas, L. (2015b). 3D-assisted feature synthesis for novel views of an object. In ICCV.
Sun, S., Kuang, Z., Sheng, L., Ouyang, W., & Zhang, W. (2018). Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In CVPR.
Taylor, G., Chosak, A., & Brewer, P. (2007). OVVV: Using virtual worlds to design and evaluate surveillance systems. In CVPR.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In CVPR.
Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.
Vázquez, D., López, A., Ponsa, D., & Marín, J. (2011). Cool world: Domain adaptation of virtual and real worlds for human detection using active learning. In NIPS workshops.
Vazquez, D., López, A. M., Marín, J., Ponsa, D., & Gerónimo, D. (2014). Virtual and real world adaptation for pedestrian detection. T-PAMI, 36(4), 797–809.
Article Google Scholar
Vedantam, R., Lin, X., Batra, T., Zitnick, C., & Parikh, D. (2015). Learning common sense through visual abstraction. In ICCV.
Veeravasarapu, V., Hota, R., Rothkopf, C., & Visvanathan, R. (2015). Simulations for validation of vision systems. CoRR. arXiv:1512.01030.
Veeravasarapu, V., Rothkopf, C., & Visvanathan, R. (2016). Model-driven simulations for deep convolutional neural networks. CoRR. arXiv:1605.09582.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103, 60–79.
Article MathSciNet Google Scholar
Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. IJCV, 119(3), 219–238.
Article MathSciNet Google Scholar
Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2017). Temporal segment networks for action recognition in videos. CoRR. arXiv:1705.02953.
Wang, X., Farhadi, A., & Gupta, A. (2016c). Actions \(\sim \) Transformations. In CVPR.
van Welbergen, H., van Basten, B. J. H., Egges, A., Ruttkay, Z. M., & Overmars, M. H. (2009). Real time character animation: A trade-off between naturalness and control. In Proceedings of the Eurographics.
Wu, W., Zhang, Y., Li, C., Qian, C., & Loy, C. C. (2018). Reenactgan: Learning to reenact faces via boundary transfer. In ECCV.
Xiong, W., Luo, W., Ma, L., Liu, W., & Luo, J. (2018). Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR.
Xu, J., Vázquez, D., López, A., Marín, J., & Ponsa, D. (2014). Learning a part-based pedestrian detector in a virtual world. T-ITS, 15(5), 2121–2131.
Google Scholar
Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shechtman, E., Hadap, S., et al. (2018). MT-VAE: Learning motion transformations to generate multimodal human dynamics. In ECCV (Vol. 11209).
Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM-MM.
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV (Vol. 11214).
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In Proceedings of the 29th DAGM conference on pattern recognition.
Zhao, Y., Xiong, Y., & Lin, D. (2018). Recognize actions by disentangling components of dynamics. In CVPR.
Zheng, Y., Lin, S., Kambhamettu, C., Yu, J., & Kang, S. B. (2009). Single-image vignetting correction. T-PAMI, 31, 2243–2256.
Article Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In CVPR.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA.
Zitnick, C., Vedantam, R., & Parikh, D. (2016). Adopting abstract images for semantic scene understanding. T-PAMI, 38(4), 627–638.
Article Google Scholar
Zolfaghari, M., Oliveira, G. L., Sedaghat, N., & Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV.

Download references

Acknowledgements

Antonio M. López acknowledges the financial support by the Spanish TIN2017-88709-R (MINECO/AEI/FEDER, UE), and by ICREA under the ICREA Academia Program. As CVC/UAB researcher, Antonio also acknowledges the Generalitat de Catalunya CERCA Program and its ACCIO agency.

Author information

Authors and Affiliations

NAVER LABS Europe, 6 chemin de Maupertuis, 38240, Meylan, France
César Roberto de Souza, Yohann Cabon & Naila Murray
Toyota Research Institute, 4440 El Camino Real, Los Altos, CA, 94022, USA
Adrien Gaidon
Centre de Visió per Computador, Universitat Autònoma de Barcelona, Edifici O, Cerdanyola del Vallès, Barcelona, Spain
Antonio Manuel López

Authors

César Roberto de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Adrien Gaidon
View author publications
You can also search for this author in PubMed Google Scholar
Yohann Cabon
View author publications
You can also search for this author in PubMed Google Scholar
Naila Murray
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Manuel López
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to César Roberto de Souza.

Additional information

Communicated by Xavier Alameda-Pineda, Elisa Ricci, Albert Ali Salah, Nicu Sebe, Shuicheng Yan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this appendix, we include random frames (Figs. 20, 21, 22, 23, and 24) for a subset of the action categories in PHAV, followed by a table of pixel colors (Table 10) used in our semantic segmentation ground-truth.

The frames below show the effect of different variables and motion variations being used (cf. Table 4). Each frame below is marked with a label indicating the value for different variables during the execution of the video, using the legend shown in Fig. 19.

Table 10 Pixel-wise object-level classes in PHAV

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Souza, C.R., Gaidon, A., Cabon, Y. et al. Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models. Int J Comput Vis 128, 1505–1536 (2020). https://doi.org/10.1007/s11263-019-01222-z

Download citation

Received: 16 November 2018
Accepted: 27 August 2019
Published: 23 October 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11263-019-01222-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Abstract

Access this article

Similar content being viewed by others

Deep Video Generation, Prediction and Completion of Human Action Sequences

Action2video: Generating Videos of Human 3D Actions

Unsupervised video-based action recognition using two-stream generative adversarial network

Notes

References

Acknowledgements