Logo MathVerse

Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

1CUHK MMLab, 2Shanghai AI Laboratory, 3CUHK MiuLar Lab,
4University of California, Los Angeles

Introduction

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood.We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.

To this end, we introduce Logo MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into 6 distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows Logo MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.

With Logo MathVerse, we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions. Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input, e.g., Qwen-VL-Max and InternLM-XComposer2. In contrast, GPT-4V and ShareGPT4V demonstrate relatively better comprehension of the visual content for mathematical reasoning. We hope the Logo MathVerse benchmark may provide unique insights to guide the future development of MLLMs.

Leaderboard

Accuracy scores on the testmini subset of Logo MathVerse.

# Model Method Source Date ALL Text
Dominant
Text
Lite
Text
Only
Vision
Intensive
Vision
Dominant
Vision
Only
CoT-E w/o CoT-E w/o CoT-E w/o CoT-E w/o CoT-E w/o CoT-E w/o CoT-E w/o
1 GPT-4V 🥇 MLLM 🖼️ Link 2023-12-26 53.6 38.9 63.1 52.1 56.6 40.9 60.3 46.1 51.4 34.9 50.8 33.6 50.3 29.8
2 Qwen-VL-Max 🥈 MLLM 🖼️ Link 2023-12-26 37.5 24.0 42.8 30.3 37.7 24.8 47.9 32.2 33.6 20.6 35.9 23.3 35.9 25.1
3 LLaVA-NeXT-34B 🥉 MLLM 🖼️ Link 2024-01-30 34.6 23.8 49.0 33.8 37.6 25.5 30.1 21.3 35.2 23.5 28.9 20.3 22.4 15.7
4 Gemini-Pro MLLM 🖼️ Link 2023-12-26 34.5 23.7 39.8 27.6 34.7 23.7 44.5 27.9 32.0 19.4 36.8 20.3 33.3 20.5
5 InternLM-XComposer2-VL-7B MLLM 🖼️ Link 2024-01-22 27.4 19.2 36.9 20.2 28.3 14.3 42.5 24.5 20.1 14.2 24.4 17.5 19.8 15.2
6 SPHINX-MoE MoE 🤖 Link 2023-10-03 26.4 18.4 33.3 26.2 21.9 17.4 40.7 26.7 21.1 16.7 19.6 12.5 18.3 11.1
7 Qwen-VL-Plus MLLM 🖼️ Link 2023-12-26 21.3 11.8 26.0 15.7 21.2 11.1 25.2 14.5 18.5 9.0 19.1 13.0 21.8 10.0
8 ShareGPT4V-13B MLLM 🖼️ Link 2023-12-26 17.4 13.1 21.8 16.2 20.6 16.2 14.6 6.6 18.6 15.5 16.2 13.8 9.7 3.7
9 LLaVA-NeXT-13B MLLM 🖼️ Link 2024-01-13 17.2 10.3 21.6 12.8 19.7 12.0 25.1 9.9 17.6 10.7 14.9 9.7 12.1 6.3
10 G-LLaVA-7B MLLM 🖼️ Link 2023-12-26 15.7 16.6 22.2 20.9 20.4 20.7 21.6 21.1 16.5 17.2 12.7 14.6 6.6 9.4
11 SPHINX-Plus MLLM 🖼️ Link 2023-12-26 14.0 12.2 16.3 13.9 12.8 11.6 15.8 14.9 12.9 11.6 14.7 13.5 13.2 10.4
12 LLaVA-1.5-13B MLLM 🖼️ Link 2023-12-26 12.7 7.6 17.1 8.8 12.0 7.6 22.6 11.5 12.6 7.4 12.7 7.4 9.0 6.9
13 MiniGPT-v2-7B MLLM 🖼️ Link 2023-12-26 10.9 11.0 13.2 12.1 12.7 12.0 15.3 11.7 11.1 13.1 11.3 10.3 6.4 7.4
14 mPLUG-Owl2-7B MLLM 🖼️ Link 2023-12-26 10.3 4.6 11.6 6.6 11.4 6.3 13.8 6.1 11.1 6.3 9.4 5.6 8.0 4.9
15 ImageBind-LLM MLLM 🖼️ Link 2023-12-26 10.0 9.3 13.2 11.4 11.6 11.3 12.9 11.7 9.8 8.9 11.8 11.2 3.5 3.4
16 LLaMA-Adapter V2 MLLM 🖼️ Link 2023-12-26 5.8 5.7 7.8 6.2 6.3 5.9 3.9 2.7 6.2 6.1 4.5 4.2 4.4 6.1
- ChatGPT LLM 📄 Link 2023-10-03 - - 51.3 33.3 38.5 18.9 51.3 33.3 - - - - - -
- GPT-4 LLM 📄 Link 2023-10-03 - - 63.4 46.5 40.7 20.7 63.4 46.5 - - - - - -
- Human Performance* - Link 2023-10-03 - 64.9 - 71.2 - 70.9 - 41.7 - 61.4 - 68.3 - 66.7
- Random Chance - Link 2023-10-03 - 12.4 - 12.4 - 12.4 - 12.4 - 12.4 - 12.4 - 12.4
Human Performance*: Average human performance from annotators that are college students.
Method types: MLLM 🖼️: Multi-modal Large Language Model, MoE 🤖: Mixture of Experts, LLM 📄: Large Language Model

Logo MathVerse Dataset

Overview

Logo MathVerse is a holistic and specialized visual math benchmark crafted to evaluate the multi-modal mathematical reasoning skills of MLLMs. This benchmark encompasses a meticulously collected dataset of 2,612 visual math problems, with 1,236 newly acquired from public question repositories and 1,376 selected from existing benchmarks, ensuring a diverse range of challenges. To specialize in mathematical reasoning, Logo MathVerse spans three primary areas: plane geometry, solid geometry, and functions. Each problem has been rigorously reviewed by expert annotators and classified into 12 detailed categories, emphasizing different fine-grained problem-solving capabilities. Notably, Logo MathVerse distinguishes itself by introducing 2 novel strategies for evaluating MLLMs.

You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics of Logo MathVerse.

data-composition

Subject distribution of Logo MathVerse.
Solid G: Solid Geometry, Plane G: Plane Geometry,

Examples

One example for each subfield in Logo MathVerse



Comparison of six problem versions in Logo MathVerse

Visualization

Experiment Results

Results on Existing Foundation Models

Visualization Examples