Tinted Frames: Question Framing Blinds Vision-Language Models

TL;DR

VLMs are selectively blind — they decide how much to look at an image based on question framing (open-ended vs Yes/No vs MCQ), even when the same visual reasoning is required. We analyze this phenomenon and mitigate it in this paper.

Teaser: attention maps showing grounding changes across framings

Figure 1. VLM grounding changes as a function of question framing. Attention rollouts reveal that while the model actively attends to the target object during open-ended generation, it exhibits disengagement and misallocation when the same question is posed as a Yes/No or MCQ task. The top 3 tokens with the highest attention are highlighted in red boxes, and the minimum and maximum values of the linear colormap are set to the same value for all images.^†

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

Hypothesis

We hypothesize that question framing (F) affects model output (Y) indirectly through visual attention (A). The framing modulates how much and where the model attends to the image, and this attention shift is the principal mediator of accuracy degradation. We validate each link of this F→A→Y pathway in the findings below.

Figure 2. Graph illustrating the F→A→Y pathway.

Key Findings

Does framing affect predictions? (F→Y)

We first test for cross-framing inconsistency. We use open-ended generation as an anchor^†. If a model answers correctly in open-ended form but fails under Yes/No or MCQ, the error is likely driven by framing, not a lack of visual understanding.

Cross-framing inconsistency evaluation protocol

Figure 3 (left). Evaluation protocol: we retain correctly answered open-ended questions, rephrase them as Yes/No and MCQ, and measure whether the correct answer is preserved.

The results reveal a surprising degree of inconsistency across all tested VLMs. On GQA, multiple models exhibit over 15% cross-framing inconsistency (nearly one in six correct answers lost by rephrasing). The task-level breakdown on SeedBench is particularly revealing our finding 1.

Cross-framing inconsistency bar charts on GQA and SeedBench

Figure 3 (right). Cross-framing inconsistency rates across VLMs (GQA) and task categories (SeedBench). Grounding-heavy tasks suffer the most.

Finding 1

Tasks requiring object grounding exhibit the highest inconsistency rates, with multiple-object grounding tasks such as spatial relation and counting being the most affected, suggesting that constrained framing is most damaging where visual grounding matters most.

How does framing reshape attention? (F→A)

Using attention rollout, we measure how framing changes the model's visual attention along three dimensions: overall visual engagement (visual energy), spatial allocation relative to task-relevant regions (box attention), and attention dispersion (entropy).

Visual energy, box attention, sink attention, entropy across framings

Figure 4. Visual energy, box attention, sink attention, and entropy across framings, with layer-wise analysis. The divergence emerges in middle layers (12–22) where cross-modal interaction occurs.

Finding 2

Constrained framings reduce overall visual energy, redirect attention from target regions to uninformative tokens, and produce more dispersed attention patterns, confirming that question framing reshapes visual attention.

Does attention actually drive the errors? (A→Y)

To test the causation^†, we perform attention steering: intervening on the model's attention under constrained framings to restore toward the open-ended levels, then measuring whether accuracy recovers. We study two complementary interventions, scaling total visual energy (how much the model looks) and redistributing attention toward target regions (where it looks).

Finding 3

Steering attention back to open-ended levels recovers accuracy, confirming the A→Y link. Spatial allocation (where the model looks) yields universal gains, while visual energy magnitude (how much it looks) primarily benefits grounding-heavy tasks, revealing that framing-induced errors stem from the attention failures.

Attention steering: multiplier plots and accuracy table

Figure 6. Accuracy improves monotonically as attention is steered closer to open-ended levels, with high Spearman correlations confirming the link.

Method

Building on the findings, we introduce a lightweight prompt-tuning approach to fix the attention misallocation.

Figure 7. Method overview. Learnable tokens are appended to constrained framings and optimized to re-align visual attention patterns with those of open-ended settings.

Discussion

Does lightweight prompt-tuning recover attention?

Our learnable tokens are optimized to re-align the attention patterns of constrained framings with those of open-ended settings. We measure whether this intervention actually restores the visual engagement that framing suppresses. The results show that our method recovers visual attention, especially for the box attention, further reduce the inconsistency rate.

Attention recovery and cross-framing inconsistency reduction

Figure 8. Our prompt-tuning approach recovers visual attention and reduces cross-framing inconsistency.

Does it improve performance across benchmarks?

Beyond recovering attention, we evaluate whether the restored visual grounding translates into accuracy gains across diverse VLMs and benchmarks. One can see that, with the better attention patterns, our method improves performance overall. Note the strong gains on the grounding-heavy tasks (Vstar).

Main results: performance across 5 VLMs and 7 benchmarks

Table 1. Main results. Our method consistently improves accuracy.

Which training objective plays the major role?

We ablate the loss components to understand which objective contributes most to the improvements. We find that without the attention alignment loss, the performance drops significantly, confirming the improvements are from better attention, not just more parameters or training data.

Table 2. Ablation on loss functions.

Qualitative Results

Qualitative: attention maps baseline vs ours

Figure A6. Qualitative comparison of attention maps: baseline vs. our method.

BibTeX

@article{fan2025tintedframes, title={Tinted Frames: Question Framing Blinds Vision-Language Models}, author={Fan, Wan-Cyuan and Luo, Jiayun and Kutscher, Declan and Sigal, Leonid and Gupta, Ritwik}, journal={arXiv preprint}, year={2026} }

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), AML-TN UBC, and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada (alliancecan.ca), companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Ritwik Gupta and Declan Kutscher were supported in part by funding from the Department of Defense, The House Fund, and BAIR's industrial alliance programs. Additional compute was provided by the Department of Defense's High Performance Computing Modernization Program. We are immensely grateful to Bicheng Xu from UBC and Stephanie Fu from UCB for sharing their valuable suggestions in paper writing and experiments.