M3T: Multi-Scale Memory Matching for Video Object
Segmentation and Tracking

  • \(^1\) University of British Columbia
  • \(^2\) Vector Institute for AI
  • \(^3\) CIFAR AI Chair
  • \(^4\) Ontario Tech University
\(^*\) Equal Contribution

Under submission

TL;DR

M3T is a video object tracking model that, with the initial segmentation mask and a video, accurately tracks the specified object throughout, even amidst object transformations.

Abstract

Video Object Segmentation (VOS) has became increasingly important with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them using time-coded memory. We illustrate that short clip length and longer memory with learned time-coding are important design choices for achieving state-of-the-art (SoTA) performance. Further, we propose multi-scale matching and decoding to ensure sensitivity and accuracy for small objects. Finally, we propose a novel training strategy that focuses learning on portions of the video where an object undergoes significant deformations -- a form of "soft" hard-negative mining, implemented as loss-reweighting. Collectively, these technical contributions allow our model to achieve SoTA performance on two complex datasets -- VISOR and VOST. A series of detailed ablations validate our design choices as well as provide insights into the importance of parameter choices and their impact on performance.

Methodology

overview

We divide an input video into non-overlapping clips of length L. For a query clip, we retrieve information on previous clips from our (a) Clip-based Memory in the form of frames and predicted (or initial reference) masks. We use a 2D-CNN backbone to obtain features for query frames Xq, and features for memory frames and masks, XM and YM respectively. We then use our proposed (b) Multi-Scale Matching Encoder to perform dense matching at multiple-scales with frame features of the query clip Xq and memory frames XM, and use the resulting similarity between frames to obtain query clip’s mask features Yq,enc as a weighted combination of the memory mask features YM. In doing so, we modulate the similarity using our proposed Relative-Time Encoding (RTE) to learn recency of information in memory, thereby facilitating propagation over long-time spans. We then use (c) Multi-Scale Decoder to aggregate the resulting query clip’s mask features Yq,enc with clip’s frame features Xq using Pixel Decoder to give contexualized feature pyramid Yq,fpn. Finally, we use Space-Time Decoder to decode mask predictions Yq , by refining learned time embeddings on the contextualized feature pyramid Yq,fpn. We update the memory with the predictions from last frame (=Lth index) in the query clip, implemented as FIFO queue. During training, we use our transformation-aware loss Ltr to form segmentation loss for the entire video.

Demo Video

Results on VOST

The videos qualitatively compare M3T with AOT. Please wait a moment for loading the videos.

GT

AOT

Ours

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Results on VISOR

The videos qualitatively compare M3T with STM. Please wait a moment for loading the videos.

GT

STM

Ours

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Related links

The M3T is implemented on top of the codebase of TubeDETR, Mask2Former, and AOT in VOST.

Citation

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC CRC, and NSERC DGs. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada , companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Additionally, we would like to thank Pavel Tokmakov, the author of the VOST datasets, for his invaluable assistance with the experimental results related to VOST. The website template was borrowed from Michaël Gharbi.