MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Introduction

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood.We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.

To this end, we introduce Logo MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into 6 distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows Logo MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.

With Logo MathVerse, we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions. Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input, e.g., Qwen-VL-Max and InternLM-XComposer2. In contrast, GPT-4V and ShareGPT4V demonstrate relatively better comprehension of the visual content for mathematical reasoning. We hope the Logo MathVerse benchmark may provide unique insights to guide the future development of MLLMs.

Leaderboard

Accuracy scores on the testmini subset of Logo MathVerse.

#	Model	Method	Source	Date	ALL		Text Dominant		Text Lite		Text Only		Vision Intensive		Vision Dominant		Vision Only
#	Model	Method	Source	Date	CoT-E	w/o	CoT-E	w/o	CoT-E	w/o	CoT-E	w/o	CoT-E	w/o	CoT-E	w/o	CoT-E	w/o
1	VL-Rethinker-72B	MLLM 🖼️	Link	2025-04-10	-	61.7	-	-	-	-	-	-	-	-	-	-	-	-
2	VL-Rethinker-7B	MLLM 🖼️	Link	2025-04-10	-	54.2	-	-	-	-	-	-	-	-	-	-	-	-
3	GPT-4V	MLLM 🖼️	Link	2023-12-26	54.4	39.4	63.1	54.7	56.6	41.4	60.3	48.7	51.4	34.9	50.8	34.4	50.3	31.6
4	Qwen-VL-Max	MLLM 🖼️	Link	2023-12-26	37.2	25.3	42.8	30.7	37.7	26.1	47.9	28.9	33.6	24.1	35.9	24.1	35.9	21.4
5	LLaVA-NeXT-34B	MLLM 🖼️	Link	2024-01-30	34.6	23.8	49.0	33.8	37.6	25.5	30.1	21.3	35.2	23.5	28.9	20.3	22.4	15.7
6	Gemini-Pro	MLLM 🖼️	Link	2023-12-26	35.3	23.5	39.8	26.3	34.7	23.5	44.5	27.3	32.0	23.0	36.8	22.3	33.3	22.2
7	G-LLaVA-13B	MLLM 🖼️	Link	2023-12-26	16.4	16.9	23.6	21.2	18.9	20.9	22.6	22.3	17.1	17.7	13.5	14.8	8.9	10.0
8	G-LLaVA-7B	MLLM 🖼️	Link	2023-12-26	15.7	16.6	22.2	20.9	20.4	20.7	21.6	21.1	16.5	17.2	12.7	14.6	6.6	9.4
9	InternLM-XComposer2-VL-7B	MLLM 🖼️	Link	2024-01-22	25.9	16.5	36.9	22.3	28.3	17.0	42.5	16.5	20.1	15.7	24.4	16.4	19.8	11.0
10	LLaVA-NeXT-13B	MLLM 🖼️	Link	2024-01-13	17.2	15.6	21.6	19.4	19.7	15.2	25.1	18.1	17.6	16.8	14.9	15.2	12.1	11.3
11	SPHINX-MoE	MoE 🤖	Link	2023-10-03	22.8	15.0	33.3	22.2	21.9	16.4	40.7	18.3	21.1	14.8	19.6	12.6	18.3	9.1
12	ShareGPT4V-13B	MLLM 🖼️	Link	2023-12-26	17.4	13.1	21.8	16.2	20.6	16.2	14.6	6.6	18.6	15.5	16.2	13.8	9.7	3.7
13	SPHINX-Plus	MLLM 🖼️	Link	2023-12-26	14.0	12.2	16.3	13.9	12.8	11.6	15.8	14.9	12.9	11.6	14.7	13.5	13.2	10.4
14	Qwen-VL-Plus	MLLM 🖼️	Link	2023-12-26	21.3	11.8	26.0	15.7	21.2	11.1	25.2	14.5	18.5	9.0	19.1	13.0	21.8	10.0
15	MiniGPT-v2-7B	MLLM 🖼️	Link	2023-12-26	10.9	11.0	13.2	12.1	12.7	12.0	15.3	11.7	11.1	13.1	11.3	10.3	6.4	7.4
16	ImageBind-LLM	MLLM 🖼️	Link	2023-12-26	10.0	9.2	13.2	11.4	11.6	11.3	12.9	11.7	9.8	8.9	11.8	11.2	3.5	3.4
17	LLaVA-1.5-13B	MLLM 🖼️	Link	2023-12-26	12.7	7.6	17.1	8.8	12.0	7.6	22.6	11.5	12.6	7.4	12.7	7.4	9.0	6.9
18	mPLUG-Owl2-7B	MLLM 🖼️	Link	2023-12-26	10.3	5.9	11.6	6.6	11.4	6.3	13.8	6.1	11.1	6.3	9.4	5.6	8.0	4.9
19	LLaMA-Adapter V2	MLLM 🖼️	Link	2023-12-26	5.8	5.7	7.8	6.2	6.3	5.9	3.9	2.7	6.2	6.1	4.5	4.2	4.4	6.1
-	ChatGPT	LLM 📄	Link	2023-10-03	-	-	51.3	33.3	38.5	18.9	51.3	33.3	-	-	-	-	-	-
-	GPT-4	LLM 📄	Link	2023-10-03	-	-	63.4	46.5	40.7	20.7	63.4	46.5	-	-	-	-	-	-
-	Human Performance*	-	Link	2023-10-03	-	64.9	-	71.2	-	70.9	-	41.7	-	61.4	-	68.3	-	66.7
-	Random Chance	-	Link	2023-10-03	-	12.4	-	12.4	-	12.4	-	12.4	-	12.4	-	12.4	-	12.4

Human Performance*: Average human performance from annotators that are college students.
Method types: MLLM 🖼️: Multi-modal Large Language Model, MoE 🤖: Mixture of Experts, LLM 📄: Large Language Model

Overview

Logo MathVerse is a holistic and specialized visual math benchmark crafted to evaluate the multi-modal mathematical reasoning skills of MLLMs. This benchmark encompasses a meticulously collected dataset of 2,612 visual math problems, with 1,236 newly acquired from public question repositories and 1,376 selected from existing benchmarks, ensuring a diverse range of challenges. To specialize in mathematical reasoning, Logo MathVerse spans three primary areas: plane geometry, solid geometry, and functions. Each problem has been rigorously reviewed by expert annotators and classified into 12 detailed categories, emphasizing different fine-grained problem-solving capabilities. Notably, Logo MathVerse distinguishes itself by introducing 2 novel strategies for evaluating MLLMs.

Examples of six versions of each problem in MathVerse.

Three categories of question texts in MathVerse.

You can download the dataset on Hugging Face Dataset.

Examples

One example for each subfield in Logo

MathVerse

Plane Geometry

Solid Geometry

Functions

Comparison of six problem versions in Logo

MathVerse

Visualization Examples

Response Comparison of GPT-4V, LLaVA-NeXT, and SPHINX-MoE. We adopt the Text-lite version of the problem, and highlight the key-step extraction and scoring by the CoT evaluation strategy.