Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

October 30, 2025

In the world of AI agents that click, scroll, execute and automate — we’re moving fast from “just understand text” to “actually use software for you.” The new benchmark SCUBA tackles exactly that: how well can agents do real enterprise workflows inside the Salesforce platform?

What makes SCUBA stand out:

It’s built around the actual workflows inside the Salesforce platform.
It covers 300 task instances derived from real user interviews (platform admins, sales reps, and service agents).
The tasks test not just “does the model answer the question” but “can the model use the UI, manipulate data, trigger workflows, troubleshoot issues.”
It addresses a gap: current benchmarks often focus on web navigation and software manipulation — but enterprise-software “computer use” is hard to measure. SCUBA aims to fill that.

Key Takeaway: If you want agents that don’t just chat, but act in business software, this is a big step.

The Business Impact

Imagine an AI assistant that can navigate your CRM, update records, launch workflows, interpret dashboard failures, and help your service team get unstuck. That’s the vision this paper leans into.

Here’s why it’s compelling:

Enterprise alignment: Many benchmarks are academic or consumer-web oriented. SCUBA puts the spotlight on business-critical environments (admin, sales, and service).
Realistic tasks: By deriving tasks from user interviews and genuine personas, it bridges the gap between “toy benchmark” and “live user situation.”
Measurable agent performance in context: It enables evaluation of how well an agent operates inside software systems, not just via text.
Roadmap for future AI assistants: As more organizations adopt AI to automate software use (not just analysis), benchmarks like this set expectations, highlight challenges, and direct progress.

For businesses like Salesforce (and their customers) the implications are clear: better agent tooling, fewer manual clicks, faster issue resolution, more efficient sales/service teams. For the AI community: a new frontier of “task execution in UI” rather than “just text reasoning”.

Key Insights:

1. Real-world domain shift is hard

The performance drop when moving from the more generic OSWorld benchmark (which covers desktop applications) to SCUBA (CRM, enterprise workflows) is significant. The experiment shows a chart of drop in success rates when shifting the benchmark.

2. Demonstrations Help

Knowledge articles and tutorials on how to use salesforce platforms are easily accessible. One natural question is whether AI agents can leverage this information effectively like humans do. The experiment results reveal that

Human demonstrations (showing the agent how to do a similar task) improved performance across most agents: higher success rates, lower time, lower token usage （please see the technical report for more details). But, some agents did not benefit as much.
Also, some ended up using more steps in demonstration-augmented mode (for example due to discovering “shortcuts” that the human demo didn’t show). So the design of demonstrations still matters.

3. Cost, latency, and practical deployment matter

Success rate is not the only metric; latency (time to complete tasks) and cost (API/token costs, number of steps) are also reported. For instance, browser-use agents had high success rates but higher latency (due to API service response time & multi-agent framework design).
Demonstration augmentation not only improves success but can reduce time and costs (the paper reports ~13% lower time, ~16% lower cost in the demonstration-augmented setting).
For enterprise adoption, this matters: an agent that succeeds but is too slow or too costly may be less useful in practice.

Implications for the Future of CRM Automation:

Ran Xu
Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

Zeyuan Chen
Senior Manager, Research

Zeyuan Chen is a Senior Manager of Research at Salesforce AI Research, where he has been contributing since 2019. His work focuses on advancing computer vision, machine learning, multimodal AI, AI agents, and workflow automation through code generation and data visualization. He holds a Bachelor’s
Read More

More by Zeyuan

Share this post :

Create a new perspective on life

Your Ads Here (365 x 270 area)

Purchase Now

Hackers Are Using WhatsApp to Deliver Malware to Windows PCs

Subscribe our newsletter

Stay updated with the latest tools and insights—straight from ToolRelay.

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

What makes SCUBA stand out:

The Business Impact

Key Insights:

1. Real-world domain shift is hard

2. Demonstrations Help

3. Cost, latency, and practical deployment matter

Implications for the Future of CRM Automation:

Share this post :

Create a new perspective on life

Hackers Are Using WhatsApp to Deliver Malware to Windows PCs

The new Supergirl trailer is getting some surprising comparisons

Best POD Sites for Influencer Merch in 2026

Iran Threatens to Attack Apple, Google, and Other US Tech Firms in Middle East

Subscribe our newsletter

Quick Links

Legal Pages

Newsletter

Copyright © 2025 ToolRelay, All rights reserved. Build by AQIVSOL.COM