To Sink or Not to Sink: Visual Information Pathways
in Large Vision-Language Models

Jiayun Luo^1,2*, Wan-Cyuan Fan^1,2*, Lyuyang Wang¹, Xiangteng He^1,2, Tanzila Rahman^1,2,
Purang Abolmaesumi¹, Leonid Sigal^1,2

¹University of British Columbia

²Vector Institute for AI

*Equal contribution. Listed in alphabetical order.

Paper arXiv AlphaXiv Discussion Code (Coming Soon) Upon Acceptance

TL;DR

We discover that sink tokens in LVLMs can originate from both the Vision Transformer (ViT) and the Large Language Model (LLM) itself. Critically, we find that ViT sinks not only carry global visual information but are essential for the LVLM to perform effective reasoning.

ViT and LLM attention sinks illustration

Figure: Illustration of ViT and LLM attention sinks in LLaVA-v1.5-7B. In LVLMs, given an image (A), we find that ViT sinks (B) are partially propagated into the LLM as (C), alongside LLM-emerged sinks (D), together outlining all sinks within the VLM (E).

Abstract

Large Vision Language Models (LVLMs) combine Vision Transformers (ViT) and Large Language Models (LLM) to understand both visual and textual information. While existing work focuses on attention sinks within the LLM, we identify a critical class of high-norm visual tokens from ViT, termed ViT attention sinks. These tokens encapsulate high-level semantic concepts that enable more effective reasoning by the LLM. We present comprehensive qualitative and quantitative analyses of these sink tokens and propose both training-free and training-based approaches to better leverage their information. Our methods demonstrate substantial improvements across multiple LVLMs and visual reasoning tasks, including mathematical problem solving, logical inference, and geometric understanding, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Key Findings

Figure: Attention to ViT sink tokens and sink dimensions of ViT and LLM sinks in LLaVA-v1.5-7B. (A) compares ViT token norms with the attention assigned during LLM decoding. (B)–(C) show the sink-dimension distributions for LLM-emergent sinks and ViT-propagated sinks.

Finding 1: Tokens with higher norms in the ViT are more likely to receive higher attention weights and become sinks in the LLM.

Finding 2: Visual sink tokens propagate into the LLM as distinct sink tokens, activating different hidden dimensions of the sinks emerged in LLM.

Figure: (A) Flow of obtaining relevancy map and word distribution. (B) Relevance map of sink and non-sink tokens. H12 and H10 denote the respective foreground and background attention head of the penultimate layer of the LLM used to extract the relevance maps. (C) Word Distribution of sink and non-sink tokens. The first and second rows represent the word distributions obtained from 300 images where cat and person are the main objects.

Finding 3: ViT sinks capture coarse-grained, high-level contextual features aligned with the specific focus of each attention head.

Figure: (A) Task Clustering based on GPT-4o annotated Image Complexity and Query Globalness. (B) Performance analysis of two model variants (i.e. Sink-only and Non-sink-only) for evaluating the influence of ViT sink tokens on different tasks.

Finding 4: ViT sink tokens encode useful semantic summaries, but only useful under the right conditions. They are beneficial for tasks with low visual complexity and global semantics but may degrade performance on tasks that demand localized, detail-rich visual processing.

Method and Experimental Results

We propose two approaches to leverage ViT attention sinks for enhanced visual reasoning: a training-free method that repositions sink tokens, and DIYSink, a training-based framework that learns to optimally project and select visual tokens. Both methods demonstrate substantial improvements across multiple vision-language benchmarks.

Method 1: Training-Free Sink Repositioning

Our training-free approach identifies ViT attention sinks and moves them to the front of the token sequence before feeding into the LLM. This simple yet effective method ensures that high-importance visual tokens are prioritized during the LLM's reasoning process, without requiring any model retraining or fine-tuning.

No additional training required
Zero-shot deployment on pretrained LVLMs
Optimal token ordering for LVLM reasoning

Figure: Overview. DIYSink leverages a Dual-MLP Projector to correctly project sink and non-sink tokens, and one of the two token selection modules, CoT-Reweighting or MLP-Reweighting, to dynamically select the best set of tokens for the LLM based on the specific input.

Method 2: DIYSink (Training from Scratch)

DIYSink trains a specialized architecture from scratch that learns to optimally handle sink and non-sink tokens. The framework uses a Dual-MLP Projector to separately project sink and non-sink tokens, ensuring each token type is appropriately transformed for the LLM. Additionally, dynamic token selection modules (CoT-Reweighting or MLP-Reweighting) adaptively choose the best token set based on the input.

Dual-MLP Projector for separate sink/non-sink processing
CoT-Reweighting and MLP-Reweighting modules
Input-adaptive token selection strategy

Results for DIYSink

We evaluate both methods on multiple vision-language benchmarks including general VQA, congnition reasoning, and math reasoning tasks. Our experiments demonstrate consistent improvements in both efficiency and accuracy compared to baseline methods.

BibTeX

@article{luofan2025sink,
  title={To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models},
  author={Luo, Jiayun and Fan, Wan-Cyuan and Wang, Lyuyang and He, Xiangteng and Rahman, Tanzila and Abolmaesumi, Purang and Sigal, Leonid},
  journal={arXiv preprint arXiv:2510.08510},
  year={2025}
}

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada, companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award.