IBM Research Releases VAKRA Benchmark to Test Real-World Agent Reasoning

The Benchmark Gap

IBM Research on April 15, 2026 released VAKRA, a tool-grounded executable benchmark designed to evaluate how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents using full execution traces.

The release addresses a structural problem in agent evaluation: most benchmarks test abstract reasoning or single-turn tool use, while production agents must chain multiple API calls, retrieve information from documents, and follow usage policies across multi-turn conversations.

VAKRA Architecture

VAKRA provides an executable environment where agents interact with over 8,000 locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.

The benchmark comprises four capability tiers:

Capability 1: API Chaining (2,077 instances)

Tests agents ability to chain 1-12 tool calls across 54 domains using SLOT-BIRD and SEL-BIRD tool collections. Agents must initialize data sources via a get_data call, then filter, sort, and aggregate results through sequential tool invocations.

Example task: "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?" requires four sequential filter operations followed by a team name extraction.

Capability 2: Tool Selection (1,597 instances)

Agents must select correct APIs from domain-specific tool sets containing 6-328 tools (average 116). This capability tests handling of the OpenAI API Specification 128-tool limit, requiring shortlisting mechanisms to manage large tool universes.

Capability 3: Multi-Hop Reasoning (869 instances)

Adds multi-hop reasoning across 38 domains. Questions require 1-5 logical hops with evidence extracted and combined from multiple API calls. Query types include comparison, aggregation, and cross-domain inference.

Capability 4: Multi-Source Reasoning with Policy Adherence (644 instances)

The most complex tier combines:

Multi-source retrieval: Information from APIs and document indexes must be combined, with decontamination ensuring each hop requires a specific source
Multi-turn conversations: Agents handle dialog context across multiple turns
Tool-usage policies: Plain-text constraints specifying which sources agents may access for different query types

Example policy: "If a users query pertains to Technology & Software, make sure you try answering them by only using document retrievers. Do not use other types of tools."

Evaluation Framework

VAKRA introduces execution-centric evaluation that assesses both final outputs and full tool-execution trajectories:

Stage	Purpose	Method
Policy adherence	Verify constraint compliance	Programmatic check (Capability 4 only)
Tool sequence comparison	Validate reasoning path	Execute predicted tools, compare response sets to ground truth
Final response evaluation	Check answer correctness	LLM-based judge for grounding and factual consistency

The evaluation uses a waterfall pipeline where later stages depend on earlier success. Notably, VAKRA accepts alternative valid tool invocations rather than enforcing strict step-level matching—an agent can reach the correct answer through a different sequence if all required information is recovered.

Scoring

Leaderboard scores weight all four capabilities equally:

Leaderboard_Score = (Capability1 + Capability2 + Capability3 + Capability4) / 4

Capabilities 1-3 use simple accuracy (correct queries / total queries). Capability 4 weights multi-source queries 2x higher than API-only or RAG-only queries, reflecting their increased complexity.

Early Results

According to IBM Research, models perform poorly on VAKRA overall. The benchmark reveals specific failure modes:

API chaining: Agents struggle with long chains (8+ calls), often losing track of intermediate results
Tool selection: Performance degrades significantly when tool lists exceed ~50 options
Multi-hop reasoning: Error rates compound with each additional hop; 4-5 hop queries show steep accuracy drops
Policy adherence: Models frequently violate tool-usage constraints, especially when policies conflict with apparent shortcuts

The live leaderboard is available at ibm-research-vakra.hf.space.

Industry Context

VAKRA joins a growing ecosystem of agent evaluation frameworks:

Stanford HELM expanded to agent scenarios in early 2026, assessing task completion, efficiency, robustness, and safety
Berkeley Function-Calling Leaderboard focuses on tool selection accuracy and multi-tool orchestration
AgentBench evaluates agents in interactive environments including OS, database, and digital commerce tasks
MLCommons Agent Working Group is developing standardized evaluation protocols for industry adoption

VAKRA differentiates through its executable environment and emphasis on enterprise-like workflows with policy constraints.

Availability

VAKRA is available as an open-source benchmark:

Dataset: huggingface.co/datasets/ibm-research/VAKRA
Code: github.com/IBM/vakra
Leaderboard: ibm-research-vakra.hf.space
Submission: Teams can submit results via the GitHub repository

What to Watch

Leaderboard rankings: How frontier models from OpenAI, Anthropic, Google, and others perform on enterprise reasoning tasks
Error analysis: IBM Research has published detailed failure mode categorization that may guide agent architecture improvements
Adoption: Whether other evaluation frameworks incorporate VAKRA-style executable environments
Policy enforcement: Development of better mechanisms for ensuring agents follow tool-usage constraints

Sources

Hugging Face Blog — "Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents" (April 15, 2026) https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis
IBM Research — "Introducing VAKRA Benchmark" https://www.ibm.com/new/announcements/introducing-vakra-benchmark
VAKRA GitHub Repository https://github.com/IBM/vakra
VAKRA Dataset https://huggingface.co/datasets/ibm-research/VAKRA
Elder et al. — "BIRD: Business Intelligence Reasoning Dataset" (2026) https://arxiv.org/pdf/2506.11266