Tinted Frames:
Question Framing Blinds Vision-Language Models
VLMs are selectively blind — they decide how much to look at an image based on question framing (open-ended vs Yes/No vs MCQ), even when the same visual reasoning is required. We analyze this phenomenon and mitigate it in this paper.
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
Hypothesis
We hypothesize that question framing (F) affects model output (Y) indirectly through visual attention (A). The framing modulates how much and where the model attends to the image, and this attention shift is the principal mediator of accuracy degradation. We validate each link of this F→A→Y pathway in the findings below.
Key Findings
Does framing affect predictions? (F→Y)
We first test for cross-framing inconsistency. We use open-ended generation as an anchor†. If a model answers correctly in open-ended form but fails under Yes/No or MCQ, the error is likely driven by framing, not a lack of visual understanding.
The results reveal a surprising degree of inconsistency across all tested VLMs. On GQA, multiple models exhibit over 15% cross-framing inconsistency (nearly one in six correct answers lost by rephrasing). The task-level breakdown on SeedBench is particularly revealing our finding 1.
Finding 1
Tasks requiring object grounding exhibit the highest inconsistency rates, with multiple-object grounding tasks such as spatial relation and counting being the most affected, suggesting that constrained framing is most damaging where visual grounding matters most.
How does framing reshape attention? (F→A)
Using attention rollout, we measure how framing changes the model's visual attention along three dimensions: overall visual engagement (visual energy), spatial allocation relative to task-relevant regions (box attention), and attention dispersion (entropy).
Finding 2
Constrained framings reduce overall visual energy, redirect attention from target regions to uninformative tokens, and produce more dispersed attention patterns, confirming that question framing reshapes visual attention.
Does attention actually drive the errors? (A→Y)
To test the causation†, we perform attention steering: intervening on the model's attention under constrained framings to restore toward the open-ended levels, then measuring whether accuracy recovers. We study two complementary interventions, scaling total visual energy (how much the model looks) and redistributing attention toward target regions (where it looks).
Finding 3
Steering attention back to open-ended levels recovers accuracy, confirming the A→Y link. Spatial allocation (where the model looks) yields universal gains, while visual energy magnitude (how much it looks) primarily benefits grounding-heavy tasks, revealing that framing-induced errors stem from the attention failures.
Method
Building on the findings, we introduce a lightweight prompt-tuning approach to fix the attention misallocation.
Discussion
Does lightweight prompt-tuning recover attention?
Our learnable tokens are optimized to re-align the attention patterns of constrained framings with those of open-ended settings. We measure whether this intervention actually restores the visual engagement that framing suppresses. The results show that our method recovers visual attention, especially for the box attention, further reduce the inconsistency rate.
Does it improve performance across benchmarks?
Beyond recovering attention, we evaluate whether the restored visual grounding translates into accuracy gains across diverse VLMs and benchmarks. One can see that, with the better attention patterns, our method improves performance overall. Note the strong gains on the grounding-heavy tasks (Vstar).
Which training objective plays the major role?
We ablate the loss components to understand which objective contributes most to the improvements. We find that without the attention alignment loss, the performance drops significantly, confirming the improvements are from better attention, not just more parameters or training data.
Qualitative Results
BibTeX
Acknowledgements
This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), AML-TN UBC, and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada (alliancecan.ca), companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Ritwik Gupta and Declan Kutscher were supported in part by funding from the Department of Defense, The House Fund, and BAIR's industrial alliance programs. Additional compute was provided by the Department of Defense's High Performance Computing Modernization Program. We are immensely grateful to Bicheng Xu from UBC and Stephanie Fu from UCB for sharing their valuable suggestions in paper writing and experiments.