Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

https://doi.org/10.1016/j.autcon.2022.104614Get rights and content
Under a Creative Commons license
open access

Highlights

  • The Multi-Scale Hybrid Vision Transformer is proposed for sewer defect classification.

  • The Sinkhorn tokenizer is proposed for non-local feature aggregation.

  • MSHViT outperforms baseline methods on the Sewer-ML sewer defect dataset.

  • The MSHViT architecture is analyzed in terms of accuracy and efficiency.

  • Visual verification of the non-local interactions, useful for informing sewer inspectors.

Abstract

A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.

Keywords

Sewer Defect Classification
Vision Transformers
Sinkhorn-Knopp
Convolutional Neural Networks
Closed-Circuit Television
Sewer Inspection

Data availability

A link to the code and model weights is available at https://vap.aau.dk/mshvit/. The Sewer-ML dataset is already freely available.

Cited by (0)