Researchers from Renmin University of China and Microsoft Research have introduced Arbor, a framework designed to help AI agents improve complex engineering systems through cumulative learning rather than repeated trial and error. The framework organizes hypotheses, experiments, and findings in a persistent tree. This allows the system to learn from earlier successes and failures while making verified improvements over time.

The framework organizes hypotheses, experiments, and findings in a persistent tree. This allows the system to learn from earlier successes and failures while making verified improvements over time.

In practical testing, Arbor delivered more than 2.5 times the verifiable performance gains achieved by standard AI coding agents across real-world engineering tasks under the same resource budget.

For enterprise AI teams, the approach could automate the continuous improvement of complex systems such as internal AI assistants, data pipelines, agent frameworks, and model-training processes.

An AI agent deployed to search internal company documents may perform well during development but later hallucinate or overlook important restrictions in production.

Correcting the system can require repeated changes to document chunking, retrieval methods, and system prompts.

When one agent changes several components at once, teams cannot easily identify which adjustment improved performance or which one caused a new problem.

Arbor addresses this by separating each proposed change into an independent hypothesis that can be tested and measured in isolation.

The researchers describe this process as autonomous optimization. An AI agent begins with an editable artifact, such as a machine-learning codebase or data pipeline, and receives a defined objective. It then attempts to improve the artifact through repeated experiments and feedback without step-by-step human supervision.

However, giving an agent more time or computing resources does not automatically produce better results.

Jiajie Jin, a co-author of the paper, said automation can keep an AI working for a long time, but repeated activity does not necessarily equal progress.

If the objective is unclear or the metric can be manipulated, long-running agents may produce changes that appear successful without delivering improvements that users actually want.

Complex tasks also require many attempts, while standard agent designs lack a reliable structure for preserving evidence and insights from each experiment.

Without durable memory, agents can repeat earlier mistakes instead of using past results to guide future work.

Existing coding agents can edit software, use tools, and run tests for hours against a defined objective.

However, they usually treat each experiment separately and cannot maintain several competing research directions at the same time.

General coding agents often store their memory in conversation transcripts. Autonomous optimization tasks can span hundreds of interactions and exceed context-window limits.

As a result, agents may lose factual evidence, forget the broader research process, become stuck on early failures, or chase small changes in evaluation scores.

Existing systems can also overfit to development metrics or exploit weaknesses in an evaluation system, creating the appearance of progress without improving real-world performance.

General-purpose coding agents commonly use a single shared working tree as well. This prevents them from safely testing several hypotheses in parallel and makes it harder to determine which change caused a particular result.

Arbor separates research strategy from individual coding work through two main components: a coordinator and executors. The coordinator is a long-running AI agent that acts like a principal investigator.

It does not edit the target codebase directly. Instead, it monitors the overall state of the research, reviews accumulated evidence, proposes new hypotheses, and decides how to use experimental results.

Executors are short-lived and focused AI agents. When the coordinator wants to test an idea, it creates an executor inside an isolated environment using a fresh Git worktree.

Each executor receives one hypothesis, implements the proposed change, runs evaluations, fixes errors, and reports the results and produced artifacts to the coordinator.

The coordinator and executors work through a mechanism called Hypothesis Tree Refinement.

The system represents the research process as a persistent, branching tree.

Each node links four elements: a hypothesis, an executable artifact, the factual evidence produced by the experiment, and a condensed insight.

Broad ideas appear near the root of the tree, while more specific refinements develop through branches and leaves.

This structure allows Arbor to explore several competing approaches without losing earlier evidence.

When an experiment fails, the system records the reason as a negative constraint. This helps prevent future agents from repeating the same mistake.

The researchers used the example of optimizing a Retrieval-Augmented Generation pipeline for an internal AI assistant.

A general coding agent asked to improve accuracy may change the chunking method, system prompt, and retrieval process in one attempt.

These combined changes make it difficult to determine which adjustment produced the improvement. The agent may also directly modify the main repository without isolating its experiments.

Arbor treats every change as a separate hypothesis.

Chunking, retrieval, and prompt changes become different branches, with each one implemented and tested in its own Git worktree.

This allows teams to identify the exact impact of each change, including cases where one method improves performance and another makes it worse.

When an executor finishes an experiment, the coordinator records the evidence in the tree and passes the resulting insight back to parent nodes.

A finding from one experiment can therefore become a broader constraint that shapes future hypotheses.

Arbor also uses a strict merge gate to prevent reward hacking and development-data overfitting.

Even when an executor reports a strong development score, the coordinator creates another isolated worktree and tests the candidate against a held-out evaluator.

The proposed change is merged into the current best version only when it improves the held-out test score.

Arbor fits within the wider concept of loop engineering, which has been promoted by figures including OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny.

The approach moves beyond single prompts and focuses on repeated cycles of observation, reasoning, action, and verification.

However, Jin warned that a loop without proper structure can fill up with untraceable attempts, leaving teams unable to determine what changed or what produced the result.

The researchers evaluated Arbor on an autonomous optimization task suite based on real-world research settings and the MLE-Bench Lite machine-learning engineering benchmark.

The task suite covered several areas of AI development, including model training, agent-harness engineering, and data synthesis.

The researchers used Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash as backbone models for coordinator and executor agents.

They compared Arbor with Codex and Claude Code while giving all systems the same resources.

For MLE-Bench Lite, Arbor was also tested against agentic research systems, including AI-Scientist, ML-Master, and AIDE.

Arbor achieved the strongest held-out test result across all tasks.

Its average relative improvement was more than 2.5 times higher than the gains produced by Codex and Claude Code.

On BrowseComp, which involved improving a search agent, Arbor increased held-out accuracy from 45.33% to 67.67%.

Codex reached 50%, while Claude Code reached 53.33%.

On MLE-Bench Lite, Arbor produced the strongest result among all tested systems when paired with GPT-5.5.

Arbor also showed greater resistance to overfitting.

During experiments involving Terminal-Bench 2.0, Claude Code achieved a development score of 75 but fell to 71 on held-out data.

Arbor recorded a lower development score of 72.22 but reached the highest held-out score of 77.36.

The result showed that Arbor’s improvements transferred more effectively to unseen data.

The researchers also tested whether Arbor’s improvements could transfer to unrelated tasks.

After Arbor optimized a search harness for BrowseComp, they tested the resulting codebase on HLE and DeepSearchQA.

The optimized code significantly improved performance on both unseen search-agent tasks.

Arbor is designed to operate on top of existing Git workflows rather than replace them.

Its final output is a standard Git branch that developers can inspect through existing code review, continuous integration and human-review processes.

Only verified improvements are merged into a separate trunk for each run.

The main repository remains unchanged until a developer chooses to promote the code manually.

Deploying Arbor comes with additional costs.

The largest expense is token usage because the long-running coordinator must continuously manage the hypothesis tree and assign work to executors.

Running several isolated worktrees at the same time also requires computing and storage resources for real experiments.

According to Jin, Arbor works best when a task has a clear and reliable metric, can tolerate a longer optimization period and offers several reasonable directions to explore.

Suitable tasks include pipeline optimization, improving data-synthesis quality and refining model-training recipes.

Teams should avoid using Arbor for tasks requiring real-time latency, obvious one-line fixes or situations where the evaluation metric is unreliable.

The quality of the result remains limited by the quality of the evaluator.

If the metric is unreliable, Arbor will simply optimize toward an unreliable result more quickly.

Jin said a future version could evaluate several objectives instead of relying on a single score.

Each artifact in the hypothesis tree could carry a set of measurements covering factors such as accuracy, latency and cost.

This would allow Arbor to move from single-score optimization toward a multi-objective Pareto search.

Get the latest tech news, telecom insights, and product launches wherever you prefer.

Add ProPakistani to Preferred Sources and see more of our stories in Google Search and Top Stories.

New Framework Makes AI Coding Agents 2.5x Better at Engineering

Related stories

Google Releases Lightning-Fast Open Source AI Model With 4x Faster Text Generation — Runs on Consumer GPUs

DeepSeek previews new AI model that ‘closes the gap’ with frontier models

World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names

Probably raises $9M to build a more reliable kind of AI

How memory tools can make AI models worse

So you’ve heard these AI terms and nodded along; let’s fix that