Phone AI Agent Benchmarks Overstate Real Performance: New Study Exposes CLI, API GapTechTheDay

Phone AI Agent Benchmarks Overstate Real Performance: New Study Exposes CLI, API Gap

A complacent story has taken over the phone-agent race. Models score high on leaderboards, press releases hum, and the public hears a comforting claim: the agent can “use a phone.” What this truly signals is narrower. It signals that the agent can look at a screen, pick a button, and tap it. Useful. Not the job. Real mobile work leaks beneath the icons into files, permissions, system state, and services. PhoneHarness, a new framework posted on arXiv, drags that hidden layer into evaluation. It tests GUI actions, plus shell commands, plus API calls. The results don’t flatter today’s models. Good. A benchmark that flatters everything turns into marketing copy with a stopwatch.

Three Modalities, One Reality Check

PhoneHarness treats phone automation as three distinct modes. GUI actions cover tapping, swiping, scrolling, and reading screenshots or accessibility trees. Older suites like AndroidWorld, AndroidLab, MAS-Bench, and AmbiBench mostly grade this mode, which quietly teaches everyone to equate screen skill with full competence. Shell actions target the device OS through commands: files, processes, state queries. API actions bypass the screen with programmatic calls that can compress many GUI steps into one. PhoneHarness runs all three inside one harness on real Android environments across 14 task categories. Same device context, different output demands. That design forces an apples-to-apples view of where agents actually break.

GUI Skill Doesn’t Transfer, It Misleads

GUI agents often look smarter than they are. They receive a screenshot, then output a tap tied to coordinates or an element id, executed through ADB or accessibility services. This rewards visual grounding and short-horizon planning. Shell work asks for something else: commands that run. Models can know what should happen and still fail because a flag, quote, or path goes wrong. APIs create a third failure pattern. Tool calls tempt models into confident nonsense: wrong parameter names, wrong types, endpoints that don’t exist. PhoneHarness makes the uncomfortable point measurable. Strong GUI scores don’t predict shell or API success.

Benchmarks Saturate, Then They Lie

Benchmarks saturate when a community optimizes for them. AndroidWorld, published in 2024 and shown at ICLR 2025, offered reproducible emulation and clean scoring. Agents now report over 90 percent success on its 116 tasks, which sounds triumphant until the suite stops separating strong models from tuned ones. Newer work tried to fix this by making GUI chains nastier. MobileWorld at ACL 2026 added cross-app workflows and MCP-augmented tasks, with the best model around 51.7 percent. AndroidDaily pushed into closed-source apps and the strongest model reached about 62 percent. PhoneHarness takes a sharper approach. It changes the modality mix. The score drops when CLI and API actions appear because the benchmark starts measuring neglected skills.

The Commercial Pitch Now Looks Risky

Labs want phone agents in the market now, and reports even hint at phones designed around agents. The sales pitch leans on benchmark wins because numbers travel well and caveats don’t. PhoneHarness makes that habit look risky. A product can shine at menu-tapping and still fail the moment it must change a permission, query system state, or call an app service with correct arguments. Users won’t care about a GUI leaderboard if the agent can’t fetch a log, can’t run a diagnostic command, or invents tool schemas. This isn’t a moral failing. It’s an engineering reality about output formats and training exposure. GUI datasets overflow with screenshot-action pairs. Shell and API supervision stays thin.

PhoneHarness should embarrass anyone who sold “phone use” as one skill. Phone automation splits into at least three action languages, and competence in one doesn’t grant competence in the others. GUI work rewards perception and pointing. Shell work punishes sloppy syntax. API work punishes invented structure. Lower scores on a benchmark that includes CLI and API actions don’t mean the benchmark broke. They mean it started measuring the job that real devices demand. The field now faces a productive choice. Keep polishing the screen-tapper and call it solved, or train and test agents that cross the boundary between glass and guts without collapsing.