M3T

M³T: Multi-Scale Memory Matching for Video Object
Segmentation and Tracking

\(^1\) University of British Columbia
\(^2\) Vector Institute for AI
\(^3\) CIFAR AI Chair
\(^4\) Ontario Tech University

\(^*\) Equal Contribution

Under submission

TL;DR

M³T is a video object tracking model that, with the initial segmentation mask and a video, accurately tracks the specified object throughout, even amidst object transformations.

Abstract

Video Object Segmentation (VOS) has became increasingly important with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, these data characteristics still present challenges. In this work we propose a novel, DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, our model enables on-line inference with long videos in a windowed fashion, by breaking the video into clips and propagating context among them using time-coded memory. We illustrate that short clip length and longer memory with learned time-coding are important design choices for achieving state-of-the-art (SoTA) performance. Further, we propose multi-scale matching and decoding to ensure sensitivity and accuracy for small objects. Finally, we propose a novel training strategy that focuses learning on portions of the video where an object undergoes significant deformations -- a form of "soft" hard-negative mining, implemented as loss-reweighting. Collectively, these technical contributions allow our model to achieve SoTA performance on two complex datasets -- VISOR and VOST. A series of detailed ablations validate our design choices as well as provide insights into the importance of parameter choices and their impact on performance.

Methodology

We divide an input video into non-overlapping clips of length L. For a query clip, we retrieve information on previous clips from our (a) Clip-based Memory in the form of frames and predicted (or initial reference) masks. We use a 2D-CNN backbone to obtain features for query frames X^q, and features for memory frames and masks, X^M and Y^M respectively. We then use our proposed (b) Multi-Scale Matching Encoder to perform dense matching at multiple-scales with frame features of the query clip X^q and memory frames X^M, and use the resulting similarity between frames to obtain query clip’s mask features Y^q,enc as a weighted combination of the memory mask features Y^M. In doing so, we modulate the similarity using our proposed Relative-Time Encoding (RTE) to learn recency of information in memory, thereby facilitating propagation over long-time spans. We then use (c) Multi-Scale Decoder to aggregate the resulting query clip’s mask features Y^q,enc with clip’s frame features X^q using Pixel Decoder to give contexualized feature pyramid Y^q,fpn. Finally, we use Space-Time Decoder to decode mask predictions Y^q , by refining learned time embeddings on the contextualized feature pyramid Y^q,fpn. We update the memory with the predictions from last frame (=L^th index) in the query clip, implemented as FIFO queue. During training, we use our transformation-aware loss L^tr to form segmentation loss for the entire video.

Demo Video

Results on VOST

The videos qualitatively compare M³T with AOT. Please wait a moment for loading the videos.

	GT	AOT	Ours
Ex. 1
Ex. 2
Ex. 3
Ex. 4

Results on VISOR

The videos qualitatively compare M³T with STM. Please wait a moment for loading the videos.

	GT	STM	Ours
Ex. 1
Ex. 2
Ex. 3
Ex. 4

Citation

@article{goyal2023m3t,
    title={M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking},
    author={Goyal, Raghav and Fan, Wan-Cyuan and Siam, Mennatullah and Sigal, Leonid},
    journal={arXiv preprint arXiv:2312.08514},
    year={2023}
}

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC CRC, and NSERC DGs. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada , companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Additionally, we would like to thank Pavel Tokmakov, the author of the VOST datasets, for his invaluable assistance with the experimental results related to VOST. The website template was borrowed from Michaël Gharbi.

M³T: Multi-Scale Memory Matching for Video Object
Segmentation and Tracking

Under submission

Paper

Web demo (coming soon)

Code (coming soon)

TL;DR

Abstract

Methodology

Demo Video

Results on VOST

GT

AOT

Ours

Ex. 1

Ex. 2

Ex. 3

Ex. 4