TAM-VT: Transformation-Aware Multi-scale Video Transformer for
Segmentation and Tracking

  • \(^1\) University of British Columbia
  • \(^2\) Vector Institute for AI
  • \(^3\) CIFAR AI Chair
\(^*\) Equal Contribution

WACV 2025
Ranked 5th in the VOTS Challenge at ECCV 2024

TL;DR

TAM-VT is a video object tracking model that, with the initial segmentation mask and a video, accurately tracks the specified object throughout, even amidst object transformations.

Abstract

Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g., in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations – a form of “soft” hard examples mining. Further, we propose a multiplicative time-coded memory, beyond vanilla additive positional encoding, which helps propagate context across long videos. Finally, we incorporate these in our proposed holistic multi-scale video transformer for tracking via multi-scale memory matching and decoding to ensure sensitivity and accuracy for long videos and small objects. Our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them. We illustrate that short clip length and longer memory with learned time-coding are important design choices for improved performance. Collectively, these technical contributions enable our model to achieve new state-of-the-art (SoTA) performance on two complex egocentric datasets – VISOR and VOST, while achieving comparable to SoTA results on the conventional VOS benchmark, DAVIS’17. Detailed ablations validate our design choices and provide insights into the importance of parameter choices and impact on performance.

Methodology

overview

We divide an input video into non-overlapping clips of length L. For a query clip, we retrieve information on previous clips from our (a) Clip-based Memory in the form of frames and predicted (or initial reference) masks. We use a 2D-CNN backbone to obtain features for query frames Xq, and features for memory frames and masks, XM and YM respectively. We then use our proposed (b) Multi-Scale Matching Encoder to perform dense matching at multiple-scales with frame features of the query clip Xq and memory frames XM, and use the resulting similarity between frames to obtain query clip’s mask features Yq,enc as a weighted combination of the memory mask features YM. In doing so, we modulate the similarity using our proposed Relative-Time Encoding (RTE) to learn recency of information in memory, thereby facilitating propagation over long-time spans. We then use (c) Multi-Scale Decoder to aggregate the resulting query clip’s mask features Yq,enc with clip’s frame features Xq using Pixel Decoder to give contexualized feature pyramid Yq,fpn. Finally, we use Space-Time Decoder to decode mask predictions Yq , by refining learned time embeddings on the contextualized feature pyramid Yq,fpn. We update the memory with the predictions from last frame (=Lth index) in the query clip, implemented as FIFO queue. During training, we use our transformation-aware loss Ltr to form segmentation loss for the entire video.

Demo Video

Results on VOST

The videos qualitatively compare TAM-VT with AOT. Please wait a moment for loading the videos.

GT

AOT

Ours

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Results on VISOR

The videos qualitatively compare TAM-VT with STM. Please wait a moment for loading the videos.

GT

STM

Ours

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Related links

The TAM-VT is implemented on top of the codebase of TubeDETR, Mask2Former, and AOT in VOST.

Citation

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC CRC, and NSERC DGs. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada , companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Additionally, we would like to thank Pavel Tokmakov, the author of the VOST datasets, for his invaluable assistance with the experimental results related to VOST. The website template was borrowed from Michaël Gharbi.