| # | Model | Localization | Semantic Understanding | Cross-modal Discrimination | Temporal Reasoning | Instant Response | Overall | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | ||
| 1 | Gemini 3.1 Pro | 88.1 | 78.6 | 58.3 | 66.5 | 80.2 | 62.9 | 31.9 | 41.8 | 81.3 | 64.1 | 39.1 | 46.6 | 82.9 | 62.3 | 24.7 | 36.6 | 85.4 | 65.1 | 32.0 | 40.5 | 83.6 | 66.6 | 37.2 | 46.4 |
| 2 | Gemini 3.0 Pro | 88.9 | 79.9 | 55.9 | 62.6 | 79.6 | 67.0 | 31.2 | 41.7 | 75.0 | 59.9 | 28.3 | 39.5 | 78.3 | 61.8 | 22.9 | 36.3 | 81.9 | 63.3 | 27.3 | 35.8 | 80.7 | 66.4 | 33.1 | 43.3 |
| 3 | Gemini 3.0 Flash | 86.8 | 79.7 | 53.1 | 63.1 | 76.3 | 63.4 | 25.4 | 40.8 | 71.1 | 56.6 | 22.5 | 37.1 | 77.3 | 60.2 | 25.5 | 38.3 | 79.3 | 58.7 | 23.4 | 34.9 | 78.2 | 63.9 | 30.2 | 43.0 |
| 4 | Gemini 2.5 Pro | 87.7 | 62.7 | 31.7 | 41.5 | 72.7 | 42.9 | 10.9 | 21.5 | 68.7 | 41.7 | 12.3 | 24.1 | 71.9 | 45.9 | 10.2 | 22.3 | 76.0 | 42.8 | 11.7 | 20.1 | 75.4 | 47.4 | 15.4 | 26.0 |
| 5 | Gemini 2.5 Flash | 77.8 | 57.0 | 29.0 | 42.6 | 69.3 | 36.9 | 8.0 | 18.1 | 62.7 | 38.0 | 11.6 | 24.6 | 65.2 | 36.2 | 8.9 | 20.8 | 70.9 | 35.0 | 3.9 | 13.9 | 69.1 | 40.7 | 12.5 | 24.2 |
| 6 | Qwen3-Omni | 65.4 | 43.9 | 10.3 | 28.5 | 61.5 | 32.1 | 2.9 | 13.9 | 52.2 | 24.6 | 2.2 | 12.5 | 63.8 | 32.4 | 6.4 | 16.5 | 67.5 | 33.3 | 3.9 | 14.0 | 62.1 | 33.4 | 5.2 | 17.2 |
| 7 | VITA-1.5 | 51.1 | 16.5 | 2.8 | 3.9 | 44.3 | 19.0 | 2.2 | 3.2 | 33.9 | 13.4 | 0.0 | 0.0 | 33.5 | 8.1 | 0.6 | 1.9 | 37.2 | 11.8 | 0.0 | 0.8 | 40.0 | 13.7 | 1.1 | 2.2 |
| 8 | MiniCPM-o-4.5 | 34.1 | 7.7 | 0.7 | 2.2 | 30.8 | 4.7 | 0.0 | 1.0 | 23.5 | 3.9 | 0.0 | 2.1 | 33.7 | 4.4 | 0.0 | 0.8 | 32.4 | 3.3 | 0.0 | 0.8 | 31.0 | 4.8 | 0.1 | 1.4 |
| 9 | Baichuan-Omni-1.5 | 17.6 | 4.7 | 0.0 | 1.0 | 14.7 | 4.0 | 0.0 | 0.5 | 11.5 | 1.0 | 0.0 | 0.0 | 17.2 | 2.8 | 0.0 | 0.0 | 18.0 | 4.0 | 0.0 | 0.2 | 15.8 | 3.3 | 0.0 | 0.4 |
* TM: Type Match, EM: Exact Match, SR: Success Rate, GP: Goal Progress.