skip to main content
10.1145/3512527.3531395acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article
Open Access

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Authors Info & Claims
Published:27 June 2022Publication History

ABSTRACT

Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at \urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.

Skip Supplemental Material Section

Supplemental Material

ICMR22-fp192.mp4

mp4

66.6 MB

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.Google ScholarGoogle Scholar
  4. Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. ICCV (2021).Google ScholarGoogle Scholar
  5. Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. 2020. Fast differentiable sorting and ranking. In International Conference on Machine Learning. PMLR, 950--959.Google ScholarGoogle Scholar
  6. Christopher Burges, Robert Ragno, and Quoc Le. 2006. Learning to rank with nonsmooth cost functions. Advances in neural information processing systems, Vol. 19 (2006), 193--200.Google ScholarGoogle Scholar
  7. Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. 89--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020 b. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  9. Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 403--412.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 a. Uniter: Universal image-text representation learning. In European conference on computer vision. Springer, 104--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the iEEE conference on computer vision and pattern recognition. 1335--1344.Google ScholarGoogle ScholarCross RefCross Ref
  12. David Cossock and Tong Zhang. 2008. Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory, Vol. 54, 11 (2008), 5140--5154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.Google ScholarGoogle ScholarCross RefCross Ref
  14. Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. 2019. Differentiable ranks and sorting using optimal transport. NeurIPS (2019).Google ScholarGoogle Scholar
  15. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2021 a. Rescaling egocentric vision. IJCV (2021).Google ScholarGoogle Scholar
  16. Dima Damen, Adriano Fragomeni, Jonathan Munro, Toby Perrett, Daniel Whettam, Michael Wray, Antonino Furnari, Giovanni Maria Farinella, and Davide Moltisanti. 2021 b. EPIC-KITCHENS-100- 2021 Challenges Report. Technical Report. University of Bristol.Google ScholarGoogle Scholar
  17. Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11162--11173.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021 a. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle Scholar
  19. Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021 b. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 2615--2624.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3354--3363.Google ScholarGoogle ScholarCross RefCross Ref
  21. Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Proceedings of the IEEE European Conference on Computer Vision. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. 2018. Stochastic Optimization of Sorting Networks via Continuous Relaxations. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  23. Michael Gutmann and Aapo Hyv"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297--304.Google ScholarGoogle Scholar
  24. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lü, Yong Zhu, and Xiao Tan. 2021. Improving Video Retrieval by Adaptive Margin. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1359--1368.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).Google ScholarGoogle Scholar
  27. Sixing Hu, Mengdan Feng, Rang MH Nguyen, and Gim Hee Lee. 2018. Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7258--7267.Google ScholarGoogle ScholarCross RefCross Ref
  28. Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-aware graph convolutional networks for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11021--11028. Issue 07.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kalervo J"arvelin and Jaana Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. ICML (2021).Google ScholarGoogle Scholar
  31. Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492--5501.Google ScholarGoogle ScholarCross RefCross Ref
  32. Junyeong Kim, Minuk Ma, Trung Pham, Kyungsu Kim, and Chang D Yoo. 2020. Modality shifting attention network for multi-modal video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10106--10115.Google ScholarGoogle ScholarCross RefCross Ref
  33. Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802--812.Google ScholarGoogle ScholarCross RefCross Ref
  34. Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331--7341.Google ScholarGoogle ScholarCross RefCross Ref
  35. Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2603--2614.Google ScholarGoogle ScholarCross RefCross Ref
  36. Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation. 24--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020 a. Hero: Hierarchical encoder for video+language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020).Google ScholarGoogle Scholar
  38. Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. 2021. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.Google ScholarGoogle Scholar
  39. Mingming Li, Shuai Zhang, Fuqing Zhu, Wanhui Qian, Liangjun Zang, Jizhong Han, and Songlin Hu. 2020 c. Symmetric metric learning with adaptive margin for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4634--4641.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020 b. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. ICCV (2021).Google ScholarGoogle Scholar
  42. Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. BMVC (2019).Google ScholarGoogle Scholar
  43. Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879--9889.Google ScholarGoogle ScholarCross RefCross Ref
  44. Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630--2640.Google ScholarGoogle ScholarCross RefCross Ref
  45. George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM, Vol. 38, 11 (1995), 39--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google ScholarGoogle Scholar
  47. Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, Joao Henriques, and Andrea Vedaldi. 2020. Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824 (2020).Google ScholarGoogle Scholar
  48. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  49. David Semedo and Jo ao Magalh aes. 2019. Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints. In Proceedings of the 27th ACM International Conference on Multimedia. 75--83.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zhan Shi, Hui Liu, and Xiaodan Zhu. 2021. Enhancing Descriptive Image Captioning with Natural Language Inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 269--277.Google ScholarGoogle ScholarCross RefCross Ref
  51. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 7464--7473.Google ScholarGoogle ScholarCross RefCross Ref
  52. Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079--5088.Google ScholarGoogle ScholarCross RefCross Ref
  53. Michael Wray, Hazel Doughty, and Dima Damen. 2021. On Semantic Similarity in Video Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3650--3660.Google ScholarGoogle ScholarCross RefCross Ref
  54. Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the IEEE International Conference on Computer Vision. 450--459.Google ScholarGoogle ScholarCross RefCross Ref
  55. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.Google ScholarGoogle ScholarCross RefCross Ref
  56. Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert Pless. 2020 b. Hard negative examples are hard, but useful. In European Conference on Computer Vision. 126--142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Hong Xuan, Abby Stylianou, and Robert Pless. 2020 a. Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2474--2482.Google ScholarGoogle ScholarCross RefCross Ref
  58. Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3536--3545.Google ScholarGoogle ScholarCross RefCross Ref
  59. Yingying Zhang, Qiaoyong Zhong, Liang Ma, Di Xie, and Shiliang Pu. 2019. Learning incremental triplet margin for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9243--9250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Luowei Zhou, Jingjing Liu, Yu Cheng, Zhe Gan, and Lei Zhang. 2021. CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning. arXiv preprint arXiv:2104.00285 (2021).Google ScholarGoogle Scholar
  61. Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17344Google ScholarGoogle Scholar

Index Terms

  1. Relevance-based Margin for Contrastively-trained Video Retrieval Models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
        June 2022
        714 pages
        ISBN:9781450392389
        DOI:10.1145/3512527

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 June 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate254of830submissions,31%

        Upcoming Conference

        ICMR '24
        International Conference on Multimedia Retrieval
        June 10 - 14, 2024
        Phuket , Thailand

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader