In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Abstract

Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types.

Quadratic-scale data generation pipeline

Our data generation leverages the promising text content generation and coding abilities of current large language models, e.g., GPT, to generate chart images and data. Specifically, LLMs allow us to synthesize raw data for charts, and then the generated Python script turns the raw data into a chart image. In this way, we can produce image data without accessing costly multimodal LLMs. Unlike previous works that prompt LLMs to iteratively generate CSV data, QAs, and Python script for each chart image -- a process that is costly to massively scale -- our pipeline features parallel code and data generation through shared templates and READMEs for consistent definitions and formats across the same chart types.

Dual-Path training with augmented QAs

Unlike generic image understanding, chart image understanding requires the model to not only comprehend the underlying data of the chart but also perform reasoning to obtain the final answers. To enhance the in-depth understanding of the model, we introduce Dual-Path training, built on top of the general chart QA pairs by including two additional augmented QAs: Data-driven QAs and JSON-only QAs. Data-driven QAs are multi-turn QAs that first prompt the model to extract JSON raw data given a chart and then answering the question based on the extracted JSON and chart. JSON-only QAs are instead a pure text QAs. Our goal is to preserve the reasoning ability of LLMs when extending to the chart domain.

Deep-dive QA for every chart, from literal to reasoned insights

We propose ChartDQA, a benchmark derived from the aforementioned synthetic dataset. It includes 20 different chart types, three levels of question–answer pairs (literal, inferential, and reasoning), and provides both long and short answers. Notably, by including three levels of QAs for each chart image, ChartDQA enables assessment of a model's ability to understand charts at varying depths—much like a human would.

Assessing the comprehension of underlying data across all chart images

Besides chart comprehension, we also evaluate the model's understanding of the underlying data of the chart. Specifically, for each chart image, we provide the raw data in JSON and CSV formats, allowing assessment of the model's ability to understand the underlying data across various chart types. This is crucial for chart comprehension, as it enables the model to reason about the data and provide more accurate answers.

Acknowledgements

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chairs, NSERC Canada Research Chair (CRC), and NSERC Discovery and Discovery Accelerator Supplement Grants. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada, companies sponsoring the Vector Institute, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award.