Research
5 minute read

IBM’s new benchmark puts industrial agents to the test

AssetOpsBench is a first-of-its-kind open-source framework for developers to build, evaluate, and improve asset management agents in scenarios that closely mimic real-world enterprise conditions.

Industrial machines can break down in hundreds of different ways that not even the experts may see coming. For enterprises trying to automate the monitoring and maintenance of their most expensive assets, the dream is they wouldn’t need to.

An AI agent, or several agents working together, could respond quickly to alerts and catch problems before they spiral into something serious, averting costly shutdowns and potentially adding years of life to equipment with large upfront replacement costs.

Agents for managing industrial assets are on the horizon, and the new AssetOpsBench framework born out of IBM Research provides a peek at what’s coming. Similar to ITBench, IBM’s sister framework for evaluating IT operation agents, AssetOpsBench lays out realistic problem scenarios for an LLM agent to solve either alone or by collaborating with other agents.

True to life, the 141 problems on AssetOpsBench require agents to call tools and coordinate with each other to interpret raw sensor data for individual machines, along with their failure mode and work order histories, to diagnose and remediate problems.

AssetOpsBench also includes an automated evaluation agent that grades the orchestrator’s final answer and each step it took to get there, using a rubric that weighs accuracy, logic, and thoroughness, among other criteria.

The benchmark is open to any orchestration architecture that developers might choose to run their agents. IBM researchers tested two popular paradigms — the “plan-and-execute” approach, in which an LLM orchestrator drafts a plan and delegates its execution to agents, tools, or a software system; and the “agents-as-tools” approach, in which an orchestrator synthesizes feedback from a team of specialized agents and executes their suggested plan.

Though the agents-as-tools approach took more time and computation, the team found it ultimately delivered better results, regardless of the orchestration model’s size. They hypothesize, however, that the more efficient plan-and-execute approach would have likely performed better, had the models been trained on task-specific knowledge.

Regardless of the architecture used, AssetOpsBench was challenging for even the largest, newest models. OpenAI’s GPT-4, a frontier model with an estimated 1.8-trillion parameters, completed just 65% of tasks under the agents-as-tools approach, followed by Meta’s Llama 4 Maverick 17-billion parameter model with a 59% task-completion rate.

IBM's Granite 3.3 8b model was the smallest to be evaluated, but it held its own against Meta’s older Llama 3.3 70b model, with a 35% completion rate to the Llama model’s 40%.

blogArt-agentOps-inlineImage.png
AssetOpsBench comes with 141 problem scenarios and four built-in AI agents that can analyze and act on raw sensor and time-series data, work orders, and machine failure mode updates.

Transparency by design

Despite its name, AssetOpsBench was designed to be much more than a benchmark. It was meant to also help researchers isolate where agents get stuck in complex, multi-step reasoning problems so they can retrace their steps, correct mistakes, and iteratively make them better at their jobs.

The researchers used a visualization tool developed at IBM, called Agent Trajectory Explorer, to essentially export instant replays of how their agents tackled different problems (other open-source replay tools are available from AgentOps or LangSmith).

Researchers used a built-in AssetOpsBench module to analyze their agents’ missteps, borrowing a multi-agent failure taxonomy devised by researchers at University of California, Berkeley. The Berkeley team identified 14 failure modes in a largescale dataset of multi-agent interactions. But using AssetOpsBench and its evaluation agent, the IBM team unearthed a class of “emerging” failures that varied slightly from those flagged by the Berkely researchers.

These nuanced failure patterns show just how complex agent collaboration is becoming, and why any good industry 4.0 agent benchmark should include failure analysis. “The ability to detect emergent, intersectional failures is a foundational requirement for reliable, multi-agent orchestration,” said Dhaval Patel, an IBM researcher who focuses on industrial automation and led the team behind AssetOpsBench.

Benchmarking LLM agents for enterprise applications

Today’s revolution in AI and natural language processing owes much of its rapid progress to rigorous, standardized benchmarks that give researchers not only a means to measure and compare competing methods, but a common vision of what success looks like.

While earlier benchmarks tested LLMs on their general knowledge and reasoning capabilities, a new class of multi-agent benchmarks that evaluate models on their ability to solve practical, real-life problems are emerging. IBM’s new IT and asset management benchmarks are the first in a series of tests tied to enterprise applications that IBM Research has planned.

Inspired by popular LLM benchmarks like MMLU and HellaSwag, AssetOpsBench has a multiple-choice format. But it goes further, probing agents on their ability to marshal sensor data, fault information, work orders, and operation manuals, to resolve problems that could stump even seasoned technicians.

A typical AssetOpsBench problem might involve predicting energy usage for a given machine. For example: “What is the predicted energy consumption for Chiller 9 in the week of 2020-04-27 based on data from the MAIN site?”

To solve the problem, the framework’s built-in IoT agent would locate and retrieve historical data and sensor readings for the right machine, at the right site, and pass it to the built-in time series agent for analysis. The time series agent, using IBM’s efficient time-series models, would pass its estimate to the LLM orchestrator to convey to the user.

Other scenarios could invoke different AssetOpsBench agents. For example, “Is the chiller compressor overheating and if so, can you generate a work order?” For this type of problem, the IoT agent would gather sensor data for the chiller compressor and pass it to the built-in failure analysis agent that’s been trained on all the things that can go wrong with an industrial chiller.

The failure analysis agent can also connect failure modes like an overheating compressor with sensor data that makes the symptoms of a pending failure visible to the monitoring system. For example, an overheating compressor could be inferred from sensor data via changes in chiller efficiency, evaporator temperature, or power input.

If the agent detects signs of trouble, it can pass its analysis to the built-in work order agent to write a ticket that will summon a technician to the site to fix the problem before it becomes something more serious.

The four agents built into AssetOpsBench use a ReAct architecture and are ready to tackle common asset management tasks. Developers can also port in their own specialized agents to evaluate alongside their LLM orchestrator.

Separately, the team has put together several datasets to test LLMs in narrow areas of industry expertise. FailureSensorIQ is among the most challenging, with more than 8,000 questions focused on how well LLMs can link machine breakdown codes, known as Failure Mode and Effect Analysis (FMEA) codes, with the raw sensor data corresponding to the issue.

The questions are so difficult that most of the frontier models they tested got only about half correct and their five human experts (in industrial engineering and data science) averaged just 60%.

Multi-agent benchmarks and beyond

A team of AI researchers at IBM recently surveyed more than 120 AI agent benchmarks and identified several areas for improvement. They called out the need to break down agent evaluations into intermediate steps to make it easier to find and fix mistakes.

They also recommended putting more emphasis on safety as well as automating the evaluation process to speed up reviews. AssetOpsBench succeeds on all three counts, said Michal Schmueli-Scheuer, a distinguished engineer who leads IBM’s AI evaluation team and co-authored the survey. “This new benchmark is both more realistic and more challenging than other industrial automation benchmarks out there,” she said.

Future versions of AssetOpsBench will factor computation and tool-use costs into its evaluations —another survey recommendation. “Our current setup assumes that API access to the environment is cost-free and unconstrained,” said Patel. “For AI agents to succeed in the business world, they need to be reliable, but also cost-effective.”

IBM recently added an LLM-powered interface to its Maximo Application Suite and is planning to bring in reinforcements, starting with a new Condition Insights agent that can size up an asset’s current state and recommend a replacement date.

The team behind AssetOpsBench hopes that the framework can help developers build, evaluate, and improve agents for industry 4.0, including agents trained for new use-cases. “We invite the community to use these real-life scenarios and the agent trajectories they generate to advance multi-agent systems to the point where they can deliver real value to enterprises,” said Jayant Kalagnanam, director of AI applications at IBM Research.

Related posts