News
3 minute read

IBM and Kaggle launch new AI leaderboards for enterprise tasks

The new leaderboards, built atop benchmarks initiated by IBM Research, are designed to accelerate progress building and evaluating AI models and agents that can solve real-world enterprise issues.

In a few short years, we’ve seen a proliferation of AI use cases that are already upending how the world lives. And in the world of enterprise, we’re beginning to see that revolution take hold.

Much like any other software, though, an AI system integrated into a business’s workflow needs to work reliably at scale. But for today’s AI tasks, it isn’t always easy to verify that the models and agents being implemented are doing exactly what engineers expect them to do.

It’s something IBM has been working to resolve. Earlier this year, IBM Research released a new set of benchmarks, called ITBench, that aimed to bring scientific rigor to showing how IT automation agents are making work easier for enterprises. More recently, researchers at IBM did the same with AssetOpsBench, a set of tools to help developers and AI practitioners build and evaluate AI agents for asset management software.

Today, IBM is partnering with Kaggle to bring those realistic enterprise benchmarks to leaderboards that will be accessible to thousands of AI engineers, developers, and scientists who visit the site every day. This will help those AI practitioners evaluate models and agents on realistic, multi-step operational tasks that they could face in real-world enterprise conditions. The goal is to provide standardized measurements that reflect the actual conditions enterprises face every day, making sure developers at these organizations are making the most informed choices about which models and agents to use.

As the world has grown more connected over the years, IT systems have in turn become increasingly complex. The number of points of failure that a software stack or data center could have is so high that even teams of people working around the clock couldn’t possibly keep tabs on everything — let alone determine whether systems might be showing signs of failure in the future. This has led to a boom in automation software to track bugs, system failures, and other issues, and even remediate them.

In the world of asset lifecycle management, the challenge comes from the diversity of the data types — from work orders to failure mode descriptions, alerts from each asset to sensor data from IoT devices. The ability to fuse these data sources to understand the context to orchestrate the relevant analysis to estimate health of an asset and the corresponding actions is a challenge that is now being tackled with AI and agents.

In the case of ITBench, that could be tasks like diagnosing faulty services in Kubernetes clusters, assessing compliance for CIS benchmarks, or explaining cost anomalies in cloud environments. For AssetOpsBench, that could be assessing an asset's condition, identifying how a system could potentially fail, or recommending sensors to help detect future failures.

Agentic systems have taken this idea to the next level, where they can potentially discover an issue before it happens, make recommendations on what actions need to be taken to fix them, and even implement the fixes themselves. The benchmarks IBM Research has been working on can help developers determine which agentic models can best address their needs. Popular general-purpose benchmarks have driven remarkable advances, but they don’t reflect the demands of real enterprise workflows, where models and agents must reason over incidents, diagnose failures, navigate complex infrastructure, and support high-stakes decision-making.

This is where Kaggle comes in. It’s the place online where AI practitioners convene to compare models, explore the latest datasets, and stress-test the software they’re looking to implement. Through their software development kit, the team made it simple to turn these benchmarks into leaderboards. “They remove the operational complexity of building and maintaining high-quality benchmarks,” said Ayhan Sebin, AI ecosystem and partnership leader at IBM Research.

By partnering with Kaggle, IBM sees it as a way to bring the broader ecosystem together to make these benchmarks stronger, and ultimately, help solve real-world problems. “We want to bring academia, clients, startups and evaluators together to collectively innovate and contribute to enterprise-grade impactful benchmarks,” said Dhaval Patel, senior technical staff member at IBM Research.

While these enterprise leaderboards on Kaggle are designed for ease of use and accessibility, they don’t yet capture all the complexities of a real production environment. Real-world IT systems involve complex conditions that include noise in the system, massive scales, and unpredictable behaviors. And in domains like site reliability engineering, agents often need to remediate incidents in real time, which is not covered by the scope of these offline benchmarks. The current focus is on providing a reproducible, simplified starting point for research and evaluation.

These IBM enterprise leaderboards mark the beginning of a broader effort within IBM Research, partnering organizations and the open community, to help solve the sorts of problems that enterprises experience every day. The team will continue expanding these benchmarks, deepening the sorts of tasks they cover, and plans to introduce agentic evaluation to the leaderboard, to help share learning with the community and contextualize the next wave of enterprise automation.

Related posts