IBM Research Releases VAKRA Benchmark to Test Real-World Agent Reasoning
IBM Research has launched VAKRA, an executable benchmark that evaluates AI agents across 62 enterprise domains with 8,000+ APIs. The benchmark reveals significant gaps in current agent capabilities, with models struggling on multi-hop reasoning and policy adherence tasks that require combining API calls with document retrieval.
IBM Research Releases VAKRA Benchmark to Test Real-World Agent Reasoning
The Benchmark Gap
IBM Research on April 15, 2026 released VAKRA, a tool-grounded executable benchmark designed to evaluate how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents using full execution traces.
The release addresses a structural problem in agent evaluation: most benchmarks test abstract reasoning or single-turn tool use, while production agents must chain multiple API calls, retrieve information from documents, and follow usage policies across multi-turn conversations.
VAKRA Architecture
VAKRA provides an executable environment where agents interact with over 8,000 locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.
The benchmark comprises four capability tiers:
Capability 1: API Chaining (2,077 instances)
Tests agents ability to chain 1-12 tool calls across 54 domains using SLOT-BIRD and SEL-BIRD tool collections. Agents must initialize data sources via a get_data call, then filter, sort, and aggregate results through sequential tool invocations.
Example task: "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?" requires four sequential filter operations followed by a team name extraction.
Capability 2: Tool Selection (1,597 instances)
Agents must select correct APIs from domain-specific tool sets containing 6-328 tools (average 116). This capability tests handling of the OpenAI API Specification 128-tool limit, requiring shortlisting mechanisms to manage large tool universes.
Capability 3: Multi-Hop Reasoning (869 instances)
Adds multi-hop reasoning across 38 domains. Questions require 1-5 logical hops with evidence extracted and combined from multiple API calls. Query types include comparison, aggregation, and cross-domain inference.
Capability 4: Multi-Source Reasoning with Policy Adherence (644 instances)
The most complex tier combines:
- Multi-source retrieval: Information from APIs and document indexes must be combined, with decontamination ensuring each hop requires a specific source
- Multi-turn conversations: Agents handle dialog context across multiple turns
- Tool-usage policies: Plain-text constraints specifying which sources agents may access for different query types
Example policy: "If a users query pertains to Technology & Software, make sure you try answering them by only using document retrievers. Do not use other types of tools."
Evaluation Framework
VAKRA introduces execution-centric evaluation that assesses both final outputs and full tool-execution trajectories:
| Stage | Purpose | Method |
|---|---|---|
| Policy adherence | Verify constraint compliance | Programmatic check (Capability 4 only) |
| Tool sequence comparison | Validate reasoning path | Execute predicted tools, compare response sets to ground truth |
| Final response evaluation | Check answer correctness | LLM-based judge for grounding and factual consistency |
The evaluation uses a waterfall pipeline where later stages depend on earlier success. Notably, VAKRA accepts alternative valid tool invocations rather than enforcing strict step-level matching—an agent can reach the correct answer through a different sequence if all required information is recovered.
Scoring
Leaderboard scores weight all four capabilities equally:
Leaderboard_Score = (Capability1 + Capability2 + Capability3 + Capability4) / 4
Capabilities 1-3 use simple accuracy (correct queries / total queries). Capability 4 weights multi-source queries 2x higher than API-only or RAG-only queries, reflecting their increased complexity.
Early Results
According to IBM Research, models perform poorly on VAKRA overall. The benchmark reveals specific failure modes:
- API chaining: Agents struggle with long chains (8+ calls), often losing track of intermediate results
- Tool selection: Performance degrades significantly when tool lists exceed ~50 options
- Multi-hop reasoning: Error rates compound with each additional hop; 4-5 hop queries show steep accuracy drops
- Policy adherence: Models frequently violate tool-usage constraints, especially when policies conflict with apparent shortcuts
The live leaderboard is available at ibm-research-vakra.hf.space.
Industry Context
VAKRA joins a growing ecosystem of agent evaluation frameworks:
- Stanford HELM expanded to agent scenarios in early 2026, assessing task completion, efficiency, robustness, and safety
- Berkeley Function-Calling Leaderboard focuses on tool selection accuracy and multi-tool orchestration
- AgentBench evaluates agents in interactive environments including OS, database, and digital commerce tasks
- MLCommons Agent Working Group is developing standardized evaluation protocols for industry adoption
VAKRA differentiates through its executable environment and emphasis on enterprise-like workflows with policy constraints.
Availability
VAKRA is available as an open-source benchmark:
- Dataset: huggingface.co/datasets/ibm-research/VAKRA
- Code: github.com/IBM/vakra
- Leaderboard: ibm-research-vakra.hf.space
- Submission: Teams can submit results via the GitHub repository
What to Watch
- Leaderboard rankings: How frontier models from OpenAI, Anthropic, Google, and others perform on enterprise reasoning tasks
- Error analysis: IBM Research has published detailed failure mode categorization that may guide agent architecture improvements
- Adoption: Whether other evaluation frameworks incorporate VAKRA-style executable environments
- Policy enforcement: Development of better mechanisms for ensuring agents follow tool-usage constraints
Sources
- Hugging Face Blog — "Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents" (April 15, 2026) https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis
- IBM Research — "Introducing VAKRA Benchmark" https://www.ibm.com/new/announcements/introducing-vakra-benchmark
- VAKRA GitHub Repository https://github.com/IBM/vakra
- VAKRA Dataset https://huggingface.co/datasets/ibm-research/VAKRA
- Elder et al. — "BIRD: Business Intelligence Reasoning Dataset" (2026) https://arxiv.org/pdf/2506.11266