To Sink or Not to Sink: Visual Information Pathways
in Large Vision-Language Models

Jiayun Luo1,2*, Wan-Cyuan Fan1,2*, Lyuyang Wang1, Xiangteng He1,2, Tanzila Rahman1,2,
Purang Abolmaesumi1, Leonid Sigal1,2
1University of British Columbia
2Vector Institute for AI
*Equal contribution. Listed in alphabetical order.

TL;DR

We discover that sink tokens in LVLMs can originate from both the Vision Transformer (ViT) and the Large Language Model (LLM) itself. Critically, we find that ViT sinks not only carry global visual information but are essential for the LVLM to perform effective reasoning.

ViT and LLM attention sinks illustration

Figure: Illustration of ViT and LLM attention sinks in LLaVA-v1.5-7B. In LVLMs, given an image (A), we find that ViT sinks (B) are partially propagated into the LLM as (C), alongside LLM-emerged sinks (D), together outlining all sinks within the VLM (E).

Abstract

Large Vision Language Models (LVLMs) combine Vision Transformers (ViT) and Large Language Models (LLM) to understand both visual and textual information. While existing work focuses on attention sinks within the LLM, we identify a critical class of high-norm visual tokens from ViT, termed ViT attention sinks. These tokens encapsulate high-level semantic concepts that enable more effective reasoning by the LLM. We present comprehensive qualitative and quantitative analyses of these sink tokens and propose both training-free and training-based approaches to better leverage their information. Our methods demonstrate substantial improvements across multiple LVLMs and visual reasoning tasks, including mathematical problem solving, logical inference, and geometric understanding, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Key Findings

Finding 1: Tokens with higher norms in the ViT are more likely to receive higher attention weights and become sinks in the LLM.

Finding 2: Visual sink tokens propagate into the LLM as distinct sink tokens, activating different hidden dimensions of the sinks emerged in LLM.

Finding 3: ViT sinks capture coarse-grained, high-level contextual features aligned with the specific focus of each attention head.

Finding 4: ViT sink tokens encode useful semantic summaries, but only useful under the right conditions. They are beneficial for tasks with low visual complexity and global semantics but may degrade performance on tasks that demand localized, detail-rich visual processing.

Method and Experimental Results

We propose two approaches to leverage ViT attention sinks for enhanced visual reasoning: a training-free method that repositions sink tokens, and DIYSink, a training-based framework that learns to optimally project and select visual tokens. Both methods demonstrate substantial improvements across multiple vision-language benchmarks.

Method 1: Training-Free Sink Repositioning

Our training-free approach identifies ViT attention sinks and moves them to the front of the token sequence before feeding into the LLM. This simple yet effective method ensures that high-importance visual tokens are prioritized during the LLM's reasoning process, without requiring any model retraining or fine-tuning.

  • No additional training required
  • Zero-shot deployment on pretrained LVLMs
  • Optimal token ordering for LVLM reasoning

Method 2: DIYSink (Training from Scratch)

DIYSink trains a specialized architecture from scratch that learns to optimally handle sink and non-sink tokens. The framework uses a Dual-MLP Projector to separately project sink and non-sink tokens, ensuring each token type is appropriately transformed for the LLM. Additionally, dynamic token selection modules (CoT-Reweighting or MLP-Reweighting) adaptively choose the best token set based on the input.

  • Dual-MLP Projector for separate sink/non-sink processing
  • CoT-Reweighting and MLP-Reweighting modules
  • Input-adaptive token selection strategy

Results for DIYSink

We evaluate both methods on multiple vision-language benchmarks including general VQA, congnition reasoning, and math reasoning tasks. Our experiments demonstrate consistent improvements in both efficiency and accuracy compared to baseline methods.

BibTeX

@article{luofan2025sink,
  title={To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models},
  author={Luo, Jiayun and Fan, Wan-Cyuan and Wang, Lyuyang and He, Xiangteng and Rahman, Tanzila and Abolmaesumi, Purang and Sigal, Leonid},
  journal={arXiv preprint arXiv:2510.08510},
  year={2025}
}

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada, companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award.