| # | Model | Localization | Semantic Understanding | Cross-modal Discrimination | Temporal Reasoning | Instant Response | Overall | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | TM | EM | SR | GP | ||
| 1 | Gemini 3.1 Pro | 87.9 | 77.8 | 57.9 | 66.2 | 80.0 | 61.9 | 31.4 | 41.4 | 81.1 | 63.0 | 38.3 | 46.1 | 82.9 | 61.8 | 24.4 | 36.4 | 85.7 | 64.8 | 31.8 | 40.3 | 83.3 | 65.4 | 36.7 | 46.1 |
| 2 | Gemini 3.0 Pro | 86.3 | 76.2 | 55.9 | 62.6 | 77.4 | 61.1 | 31.4 | 42.0 | 76.8 | 59.3 | 29.8 | 41.2 | 79.1 | 60.7 | 23.1 | 36.2 | 81.4 | 63.1 | 27.1 | 35.7 | 80.0 | 63.7 | 33.5 | 43.6 |
| 3 | Gemini 3.0 Flash | 85.0 | 75.6 | 53.1 | 63.1 | 75.3 | 58.5 | 25.5 | 41.1 | 73.2 | 56.4 | 23.4 | 39.1 | 80.4 | 60.0 | 25.6 | 38.6 | 78.7 | 58.4 | 23.3 | 34.8 | 78.4 | 61.4 | 30.4 | 43.5 |
| 4 | Gemini 2.5 Pro | 86.1 | 58.1 | 31.7 | 41.5 | 72.8 | 37.7 | 11.7 | 22.4 | 70.6 | 39.5 | 12.8 | 24.5 | 73.9 | 44.1 | 10.3 | 22.5 | 77.2 | 43.1 | 11.6 | 20.0 | 75.7 | 44.1 | 15.7 | 26.3 |
| 5 | Gemini 2.5 Flash | 75.1 | 50.9 | 29.0 | 42.6 | 70.4 | 34.3 | 8.0 | 18.2 | 64.8 | 35.6 | 11.3 | 24.9 | 68.0 | 35.1 | 9.0 | 21.1 | 71.2 | 35.4 | 4.7 | 14.8 | 69.7 | 37.8 | 12.6 | 24.6 |
| 6 | Qwen3-Omni | 65.7 | 42.4 | 10.3 | 28.5 | 58.3 | 29.6 | 2.9 | 14.0 | 57.8 | 26.1 | 2.1 | 13.2 | 66.7 | 31.5 | 6.4 | 16.8 | 67.6 | 34.3 | 3.9 | 14.1 | 63.2 | 32.4 | 5.2 | 17.5 |
| 7 | VITA-1.5 | 48.4 | 14.8 | 2.8 | 3.9 | 43.4 | 16.4 | 2.2 | 3.2 | 33.5 | 11.7 | 0.0 | 1.0 | 35.9 | 7.7 | 0.6 | 1.9 | 37.1 | 10.9 | 0.0 | 0.8 | 39.4 | 12.1 | 1.1 | 2.2 |
| 8 | MiniCPM-o-4.5 | 34.8 | 7.4 | 0.7 | 2.2 | 34.7 | 5.5 | 0.0 | 1.0 | 25.9 | 4.1 | 0.0 | 1.8 | 35.7 | 4.2 | 0.6 | 1.2 | 33.0 | 3.4 | 0.0 | 1.0 | 32.9 | 4.9 | 0.3 | 1.5 |
| 9 | Baichuan-Omni-1.5 | 19.5 | 4.9 | 0.0 | 1.0 | 16.2 | 4.0 | 0.0 | 0.5 | 12.6 | 1.4 | 0.0 | 0.0 | 18.9 | 2.4 | 0.0 | 0.2 | 17.9 | 4.1 | 0.0 | 0.5 | 17.0 | 3.3 | 0.0 | 0.4 |
* TM: Type Match, EM: Exact Match, SR:
Success Rate, GP: Goal Progress.