In the world of AI agents that click, scroll, execute and automate — we’re moving fast from “just understand text” to “actually use software for you.” The new benchmark SCUBA tackles exactly that: how well can agents do real enterprise workflows inside the Salesforce platform?
What makes SCUBA stand out:

- It’s built around the actual workflows inside the Salesforce platform.
- It covers 300 task instances derived from real user interviews (platform admins, sales reps, and service agents).
- The tasks test not just “does the model answer the question” but “can the model use the UI, manipulate data, trigger workflows, troubleshoot issues.”
- It addresses a gap: current benchmarks often focus on web navigation and software manipulation — but enterprise-software “computer use” is hard to measure. SCUBA aims to fill that.
Key Takeaway: If you want agents that don’t just chat, but act in business software, this is a big step.
The Business Impact
Imagine an AI assistant that can navigate your CRM, update records, launch workflows, interpret dashboard failures, and help your service team get unstuck. That’s the vision this paper leans into.
Here’s why it’s compelling:
- Enterprise alignment: Many benchmarks are academic or consumer-web oriented. SCUBA puts the spotlight on business-critical environments (admin, sales, and service).
- Realistic tasks: By deriving tasks from user interviews and genuine personas, it bridges the gap between “toy benchmark” and “live user situation.”
- Measurable agent performance in context: It enables evaluation of how well an agent operates inside software systems, not just via text.
- Roadmap for future AI assistants: As more organizations adopt AI to automate software use (not just analysis), benchmarks like this set expectations, highlight challenges, and direct progress.
For businesses like Salesforce (and their customers) the implications are clear: better agent tooling, fewer manual clicks, faster issue resolution, more efficient sales/service teams. For the AI community: a new frontier of “task execution in UI” rather than “just text reasoning”.

Key Insights:
1. Real-world domain shift is hard
The performance drop when moving from the more generic OSWorld benchmark (which covers desktop applications) to SCUBA (CRM, enterprise workflows) is significant. The experiment shows a chart of drop in success rates when shifting the benchmark.

2. Demonstrations Help
Knowledge articles and tutorials on how to use salesforce platforms are easily accessible. One natural question is whether AI agents can leverage this information effectively like humans do. The experiment results reveal that
- Human demonstrations (showing the agent how to do a similar task) improved performance across most agents: higher success rates, lower time, lower token usage (please see the technical report for more details). But, some agents did not benefit as much.
- Also, some ended up using more steps in demonstration-augmented mode (for example due to discovering “shortcuts” that the human demo didn’t show). So the design of demonstrations still matters.

3. Cost, latency, and practical deployment matter
- Success rate is not the only metric; latency (time to complete tasks) and cost (API/token costs, number of steps) are also reported. For instance, browser-use agents had high success rates but higher latency (due to API service response time & multi-agent framework design).
- Demonstration augmentation not only improves success but can reduce time and costs (the paper reports ~13% lower time, ~16% lower cost in the demonstration-augmented setting).
- For enterprise adoption, this matters: an agent that succeeds but is too slow or too costly may be less useful in practice.
Implications for the Future of CRM Automation:




Ran Xu
Director, AI Research
Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.
More by Ran

Zeyuan Chen
Senior Manager, Research
Zeyuan Chen is a Senior Manager of Research at Salesforce AI Research, where he has been contributing since 2019. His work focuses on advancing computer vision, machine learning, multimodal AI, AI agents, and workflow automation through code generation and data visualization. He holds a Bachelor’s…
Read More
degree from Huazhong University of Science and Technology, a Master’s from Cornell University, and a Ph.D. from North Carolina State University, experiences that have shaped his journey in AI research.
More by Zeyuan




