OmniGUI

Introduction

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs-comprising static images, synchronous audio, and video clips—at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise.

Leaderboard

#	Model	Localization				Semantic Understanding				Cross-modal Discrimination				Temporal Reasoning				Instant Response				Overall
#	Model	TM	EM	SR	GP	TM	EM	SR	GP	TM	EM	SR	GP	TM	EM	SR	GP	TM	EM	SR	GP	TM	EM	SR	GP
1	Gemini 3 Pro	86.3	76.2	55.9	62.6	77.4	61.1	31.4	42.0	76.6	59.1	30.1	41.3	78.9	61.0	22.7	36.9	81.8	62.7	27.6	35.6	80.0	63.6	33.4	43.6
2	Gemini 3 Flash	85.0	75.6	53.1	63.1	75.3	58.5	25.5	41.1	72.8	56.0	23.5	38.7	80.0	60.3	25.3	39.4	79.2	57.9	22.8	34.2	78.3	61.3	30.3	43.5
3	Gemini 2.5 Pro	86.1	58.1	31.7	41.5	72.8	37.7	11.7	22.4	70.6	40.1	13.2	25.1	73.8	44.3	9.7	22.5	76.6	42.1	11.0	19.5	75.7	44.1	15.5	26.3
4	Gemini 2.5 Flash	75.1	50.9	29.0	42.6	70.4	34.3	8.0	18.2	64.9	35.7	11.8	25.3	67.7	35.1	9.1	21.8	71.0	34.5	3.9	13.7	69.5	37.8	12.4	24.5
5	Qwen3-Omni	65.7	42.4	10.3	28.5	58.3	29.6	2.9	14.0	57.9	26.2	2.2	13.2	66.2	31.1	5.8	16.8	67.4	33.7	3.9	13.7	63.1	32.3	5.1	17.4
6	VITA-1.5	48.4	14.8	2.8	3.9	43.4	16.4	2.2	3.2	33.9	11.5	0.0	0.8	35.4	7.7	0.6	2.0	36.9	10.3	0.0	0.8	39.3	12.1	1.1	2.2
7	MiniCPM-o-4.5	34.8	7.4	0.7	2.2	34.7	5.5	0.0	1.0	25.2	4.4	0.0	2.2	34.8	3.9	0.0	0.6	33.3	3.2	0.0	0.8	32.8	4.8	0.1	1.4
8	Baichuan-Omni-1.5	19.5	4.9	0.0	1.0	16.2	4.0	0.0	0.5	12.9	1.4	0.0	0.0	18.2	2.3	0.0	0.2	18.2	4.1	0.0	0.5	17.0	3.3	0.0	0.4

* TM: Type Match, EM: Exact Match, SR: Success Rate, GP: Goal Progress.

Example

OmniGUI Framework Overview: A GUI agent completes smartphone tasks across multiple steps based on multimodal instructions. At each step, it processes synchronous inputs—static screenshots, real-time audio, temporal video clips, and action history—to predict the next action (e.g., TYPE, TAP). Performance is evaluated using Type Match (TM) and Exact Match (EM) against human demonstrations.

Statistics

(a) Application & Language Distribution: Overview of the 29 apps, language split (Chinese/English), and data scale (episodes/steps).
(b) Task Dimension Distribution: Distribution of episodes and steps across five key task dimensions.
(c) Modality Dependency: Proportion of AV-Critical, AV-Supportive, and AV-Present tasks.

Dataset Breakdown

Comprehensive breakdown of the OmniGUI dataset per application: The table reports the volume of episodes (Ep.) and total steps (Stp.) alongside their exact distribution across the five Task Dimensions (Loc. = Localization, Sem. = Semantic Understanding, Cro. = Cross-modal Discrimination, Tem. = Temporal Reasoning, Ins. = Instant Response) and the three Multimodal Dependency levels (Cri. = AV-Critical, Sup. = AV-Supportive, Pre. = AV-Present).

Performance of the Strongest Baseline (Gemini 3.0 Pro) Across Applications

Performance of Gemini 3.0 Pro disaggregated by application: The horizontal bar chart reports the Exact Match (EM), Goal Progress (GP), and Success Rate (SR) metrics, sorted in ascending order by EM from bottom to top. The overall benchmark averages (EM=66.4%, GP=43.3%) are indicated by vertical dotted lines.

OmniGUI

A Step-Level GUI Agent Benchmark
in Omni-Modal Smartphone Environments

Introduction

Leaderboard

OmniGUI

Example

Statistics

Dataset Breakdown

Benchmark Comparison

Action Space

Experimental Result Analysis

Model Capability Fingerprints

Performance of the Strongest Baseline (Gemini 3.0 Pro) Across Applications

Citation

OmniGUI

A Step-Level GUI Agent Benchmark in Omni-Modal Smartphone Environments

Introduction

Leaderboard

OmniGUI

Example

Statistics

Dataset Breakdown

Benchmark Comparison

Action Space

Experimental Result Analysis

Model Capability Fingerprints

Performance of the Strongest Baseline (Gemini 3.0 Pro) Across Applications

Citation

A Step-Level GUI Agent Benchmark
in Omni-Modal Smartphone Environments