Horizon — building agents that learn

Introducing Horizon

We're releasing Horizon, a benchmark that measures an agent's ability to learn from past experience. Each task requires understanding months of real interactions with customers across millions of tokens to succeed.

Bryan Houlton & Aayush Gupta|June 17, 2026

Horizon is derived from failures we saw running Claw-like agents with real customers over the last year. We believe the future will be full of persistent background agents that act by themselves, but the clear adoption bottleneck is that agents still cannot reliably learn over time.

Horizon makes no distinction between models and harnesses, aiming instead to measure the learning ability of the agent.

HarnessBest ModelOverallEasyMediumHard
OpenClaw (LCM)claude-opus-4.8113/19557.9%56/6586.2%46/6570.8%11/6516.9%
RLMclaude-opus-4.8109/19555.9%57/6587.7%38/6558.5%14/6521.5%
Codexgpt-5-codex90/19546.2%51/6578.5%30/6546.2%9/6513.8%
Claude Codeclaude-opus-4.888/19545.1%51/6578.5%27/6541.5%10/6515.4%
RAGgpt-5.577/19539.5%53/6581.5%24/6536.9%0/650.0%
Hermesgpt-5.571/19536.4%47/6572.3%20/6530.8%4/656.2%

Preview run, subject to change.

Example Task

Every Horizon task has a long historical trace that the agent must learn from to complete the task correctly. Each trace is real and months-long, pulled from one of our products, Acadia Learning.

A historical trace of millions of tokens over months, paired with a single new task. One slice of the trace is magnified to show a worksheet sent as a .docx failing to open on the student's Chromebook before a PDF link works; the task is a new request to send practice materials.

For example: months ago in this trace, the agent sent a worksheet as a .docx attachment, the student could not open it on a school-issued Chromebook, and part of the session was lost before a PDF link worked. Nothing in that exchange is marked as a preference or a rule; it is one failed handoff inside months of routine activity. When a new request to send practice materials arrives, the task tests whether the agent sends a PDF link on the first try. Lessons can be anywhere in the trace, occur multiple times, or require multiple data points to extract the required pattern.

Horizon contains 195 tasks, but is private to prevent overfitting and keep user data secure*. We have included a few example eval cases in our public repo, including a public HuggingFace dataset of traces, to show how the benchmark is structured.

Each task runs in an environment with real tools (email and SMS inboxes, and more) and is graded on completion, cost, and speed, judged from the final environment state by an LLM plus deterministic checks.

Learning is unpredictable

We categorize tasks based on three dimensions: predictability, burial depth, and number of learnings required. Predictability is how easy it is to predict what the agent will need to learn, manually categorized. Burial depth is how far back from the present task the required fact sits, as a percentage of the trace. Number of learnings required is how many separate facts from the trace must be learned and combined to pass the task.

RLM
Claude Code
Codex
RAG
Hermes
OpenClaw (LCM)
Predictability
How predictable the needed learning is
Burial depth
% back from the present task to where the required fact sits
Number of learnings required
How many learnings must be combined to pass
Pass rate by harness across each axis level (easiest → hardest left to right), pooled across all models on that harness. Burial depth bins are equal-sized groups of tasks by how far back from the present task the required fact sits, as a % of the trace (deeper = older memory).

All harnesses show similar patterns where predictability and number of learnings required are the clearest detractors. This makes sense, as Horizon's hardest tasks tend to be the least predictable and require more learnings. The data also suggests that agents are better at learning from their early and late experiences than their middle ones, similar to in-context rot.

Future work will prioritize tasks that are less predictable and require more learnings, like testing if the agent can recognize implicit but unexpected patterns in realistic traces.

Learning scales slowly

Models are slowly getting better at Horizon, suggesting that scaling pretraining and reinforcement learning improves a model's ability to learn from long-horizon traces. However, the improvement rate of models is much slower on hard tasks, suggesting that intelligence alone may not be enough to learn effectively from long-horizon traces.

Scaling test-time compute is weakly correlated with pass rate, but correlation varies widely between harnesses. Harnesses like OpenClaw and Hermes primarily rely on accumulating learnings over time for fast access at test time, while harnesses like RLM and RAG spend more tokens on searching the trace during the task. Our sample is small, but harnesses that accumulate do not seem to benefit from additional test-time scaling while harnesses that search do improve.

Models are only slowly improving on hard tasks
Easy
Medium
Hard
Harnesses scale reasoning differently
OpenClawr=-0.76
RLMr=+0.31
RAGr=-0.03
Hermesr=-0.49

Importantly, none of the tasks in Horizon are challenging to reason through. When we ran an oracle with perfect context, it only used a few thousand tokens to successfully complete the task. This suggests that test-time scaling may not be necessary with the right harnesses.

Takeaways

The clearest takeaway from Horizon is that when learnings are not predictable, both search and accumulation strategies fail. None of Horizon's tasks are cognitively challenging, and while models are getting better at searching traces, we are also excited about representation learning and harness research as potential solutions.

Future versions of Horizon will focus on low predictability pattern matching tasks, as we believe this is the most important remaining capability for agents to operate autonomously in the messy real world. If you're interested in working on this with us, we're hiring.

Integrity

To ensure that each task is fair, we built four test agents.

  1. Oracle: a script to deterministically solve each task. We made sure that this reliably scored 100% with low variance, showing that our rubrics are consistent.
  2. Anti-Oracle: a script that does nothing. We made sure that this reliably scored 0% with low variance, showing that our rubrics are consistent.
  3. PerfectContext: for each task, we manually fed the agent the important lines from the trace. We made sure that this reliably scored 100% with low variance, showing that each task is easily solvable with the right context.
  4. EnvironmentOnly: an agent that has no way to access the trace, and can only interact with the task's environment. Since each environment is stateful (email inboxes, sms inboxes, etc), ensuring a 0% score with low variance verifies that the solution cannot be derived from the environment.

All 195 tasks passed these tests with low variance, showing that they are solvable, the judges are fair, and the tasks do not leak information.

Testing a human baseline is impossible (even reading the traces is equivalent to 1,300 Harry Potter books), but each task has been reviewed by a human and deemed reasonable. Agent implementations are available in our public repo.

We did not get a chance to benchmark Anthropic's Fable 5 before it was removed.

Thank Yous

Thank you to Dr. Furong Huang, Mehul Arora, Sean McLeish, Hamidah Oderinwale and others for reviewing this post. We are also grateful to the teams at Daytona, Harbor, OpenAI, and Anthropic for their support.

All Results

Score vs. Cost

Horizon (195 tasks), preview run; hover a point to reveal its model

OpenClaw (LCM)
RLM
Codex
Claude Code
RAG
Hermes
AgentModelReleasedCompletionCost / taskTime / taskTokens / task
OpenClaw (LCM)claude-opus-4.8May 28, 202657.9%$1.4292m 43s184k
OpenClaw (LCM)gemini-3.5-flashMay 19, 202657.4%$0.9923m 51s447k
OpenClaw (LCM)gpt-5.5Apr 24, 202656.9%$0.7983m 45s60k
RLMclaude-opus-4.8May 28, 202655.9%$0.7853m 32s376k
OpenClaw (LCM)gemini-3.1-pro-previewFeb 19, 202652.6%$0.8572m 52s331k
RLMgpt-5Aug 7, 202550.8%$0.4006m 14s473k
RLMclaude-sonnet-4.6Feb 17, 202649.7%$0.9545m 53s1.1M
Codexgpt-5-codexSep 15, 202546.2%$0.3425m 50s1.0M
Claude Codeclaude-opus-4.8May 28, 202645.1%$2.5192m 03s1.0M
OpenClaw (LCM)claude-sonnet-4.5Sep 29, 202542.1%$1.5642m 41s458k
OpenClaw (LCM)deepseek-v4-proApr 24, 202640.2%$0.1485m 17s328k
RAGgpt-5.5Apr 24, 202639.5%$0.6723m 22s215k
RLMclaude-haiku-4.5Oct 15, 202538.5%$0.1573m 11s516k
RAGclaude-opus-4.8May 28, 202636.9%$1.0163m 04s191k
Hermesgpt-5.5Apr 24, 202636.4%$1.5272m 25s~100k
Hermesclaude-opus-4.8May 28, 202635.9%$1.0952m 27s~207k
OpenClaw (LCM)claude-haiku-4.5Oct 15, 202533.3%$1.0222m 35s932k
RAGclaude-sonnet-4.5Sep 29, 202533.3%$0.5292m 35s169k
RAGclaude-haiku-4.5Oct 15, 202531.8%$0.1892m 15s181k
Claude Codeclaude-sonnet-4.5Sep 29, 202531.8%$0.4542m 10s907k
Hermesclaude-sonnet-4.5Sep 29, 202529.2%$0.6292m 10s~198k
RLMgpt-5-miniAug 7, 202524.6%$0.0763m 23s340k
RAGgpt-5-miniAug 7, 202519.5%$0.0232m 32s83k
RAGgemini-3.5-flashMay 19, 202612.8%$0.2732m 27s269k

Preview run, subject to change. All data was collected with proper user permissions. Download the raw results (JSON).