Harness design for long-running application development

长运行应用程序开发的线束设计

https://www.anthropic.com/engineering/harness-design-long-running-apps

Harness design is key to performance at the frontier of agentic coding. Here's how we pushed Claude further in frontend design and long-running autonomous software engineering.

Written by Prithvi Rajasekaran, a member of our Labs team.

线束设计是智能体编码前沿性能的关键。以下是我们在前端设计和长期自主软件工程领域进一步推动Claude发展的方法。


Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

过去几个月,我一直在研究两个相互关联的问题:如何让Claude生成高质量的前端设计,以及如何让它无需人工干预就能构建完整的应用程序。这项工作源于我们早期在[前端设计技能](frontend design skill)和[长期运行的编码代理框架](long-running coding agent harness)上的努力,我和同事们通过提示工程和框架设计,成功将Claude的表现提升到远超基准水平——但两者最终都遇到了瓶颈。

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

为了突破,我寻找了在两种截然不同的领域中都能适用的新型AI工程方法:一种由主观品味定义,另一种由可验证的正确性和可用性定义。受生成对抗网络(GANs)的启发,我设计了一个包含生成器和评估器智能体的多智能体结构。要构建一个能可靠且富有品味地评估输出的评估器,首先需要开发一套标准,将“这个设计好吗?”之类的主观判断转化为具体的、可分级的标准。

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

随后,我将这些技术应用于长期自主编码工作,并延续了早期工具开发中的两条经验:将构建过程分解为可处理的模块,以及利用结构化构件在会话间传递上下文。最终成果是一个由规划器、生成器和评估器组成的三智能体架构,该架构能在持续数小时的自主编码会话中产出功能完善的全栈应用程序。

Why naive implementations fall short

为什么原生的实现会失败

We've previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the "Ralph Wiggum" method using hooks or scripts to keep agents in continuous iteration cycles.

我们此前已经证明,线束设计对长期运行的自主编码效能具有重大影响。在早期实验中,我们采用初始化代理将产品规格分解为任务列表,并由编码代理逐个功能实现任务,最后通过移交工件实现跨会话的上下文传递。开发社区已普遍认同类似理念,例如采用"拉尔夫·维古姆"方法,通过钩子或脚本使代理保持持续迭代循环。

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

但一些顽固问题依然存在。面对更复杂的任务时,智能体仍会随时间推移逐渐偏离正轨。在拆解该问题时,我们观察到智能体执行此类任务存在两种常见故障模式。

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit "context anxiety," in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent's state and the next steps—addresses both these issues.

首先是模型在长任务中随着上下文窗口填满容易失去连贯性(参见我们关于上下文工程的帖子)。某些模型还会表现出"上下文焦虑"——当接近其自认为的上下文容量极限时,它们会过早地结束工作。通过上下文重置(完全清空上下文窗口并启动新代理,同时采用结构化交接来传递前代理的状态和后续步骤)可以同时解决这两个问题。

||参考链接

||https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn't give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

这与压缩不同,在压缩过程中,对话的早期部分会被就地总结,以便同一个代理可以继续处理缩短后的历史记录。虽然压缩保持了连续性,但它并没有给代理一个全新的开始,这意味着上下文焦虑仍然可能存在。而重置提供了一个全新的开始,代价是交接时需要携带足够的状态,以便下一个代理能够顺利接手工作。在我们早期的测试中,发现Claude Sonnet 4.5表现出强烈的上下文焦虑,仅靠压缩无法确保长期任务的出色表现,因此上下文重置成为测试框架设计的关键。这虽然解决了核心问题,但每次运行框架都会增加编排复杂度、令牌开销和延迟。

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work they've produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

我们之前未曾讨论的第二个问题是自我评估。当被要求评价自己完成的作品时,智能体往往会自信满满地给予赞美——即便在人类观察者眼中,其质量明显平庸。这种问题在主观性任务(例如设计)中尤为突出,因为这类任务不像可验证的软件测试那样存在二元判断标准。某个布局究竟是精致还是平庸属于主观判断,而智能体在自我评分时总会偏向积极评价。

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn't immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

然而,即便在执行可验证结果的任务时,智能体仍时常表现出阻碍任务完成的错误判断。事实证明,将执行工作的智能体与评估工作的智能体分离,是解决该问题的有效手段。这种分离本身并不能立即消除评估的宽容倾向——评估者仍是倾向于对LLM生成结果持宽容态度的大语言模型。但调整独立评估器使其保持怀疑态度,远比让生成器自我批判更易实现。而一旦存在外部反馈机制,生成器便获得了可具体迭代改进的参照依据。

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

前端设计:让主观质量可分级
我首先从前端设计入手,因为这个问题在自我评估中最为明显。在没有任何干预的情况下,Claude通常倾向于采用安全、可预测的布局,这些布局在技术上功能完善,但在视觉上平淡无奇。

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. "Is this design beautiful?" is hard to answer consistently, but "does this follow our principles for good design?" gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

构建前端设计工具时,我获得了两点核心认知:其一,虽然审美无法完全量化评分(个人品味始终存在差异),但通过制定包含设计原则与偏好的评分标准,能有效提升美学表现。"这个设计美吗?"难以获得一致答案,但"是否符合优质设计准则?"为AI评审提供了具体依据。其二,将前端生成与评审环节分离,可形成驱动输出质量持续提升的反馈闭环。

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

  • Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.
  • Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.
  • Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.
  • Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

鉴于此,我编写了四条评分标准,并将其写入生成器和评估器智能体的提示词中:

设计质量:作品是否呈现有机整体感而非零散部件拼凑?优秀表现为色彩、字体、布局、图像等细节共同营造出鲜明的氛围与个性。

原创性:是否存在定制化设计决策?还是仅套用模板布局、默认库元素或AI生成图案?合格作品应体现人类设计师的刻意创意选择。直接使用现成素材(或出现AI生成典型特征如白卡配紫色渐变)将在此项失分。

工艺水准:技术执行层面——字体层级、间距一致性、色彩协调性、对比度等。此项评估专业能力而非创意性,多数合理方案可达标,不合格意味着基础规范存在缺陷。

功能性:与美学无关的可用性评估。用户能否理解界面功能、快速定位主要操作、无需猜测即可完成任务?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

我强调设计质量和原创性胜过工艺和功能性。克劳德在工艺和功能性方面已经默认表现良好,因为所需的技术能力往往自然而然地出现在模型中。但在设计和原创性方面,克劳德产生的输出往往最多只能算是平淡无奇。评判标准明确惩罚高度通用的"AI垃圾"模式,通过更重视设计和原创性,推动模型进行更具审美风险的尝试。

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

我通过使用带有详细分数分解的小样本示例来校准评估器。这确保了评估者的判断与我的偏好一致,并减少了迭代过程中的分数漂移。

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator's critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn't working.

我在Claude Agent SDK上构建了这个循环系统,使得整个编排流程保持简洁高效。首先由生成器智能体根据用户提示创建HTML/CSS/JS前端页面。我为评估器配备了Playwright MCP模块,使其能够在实时页面上直接交互,逐项评分并撰写详细评审意见。实际运作中,评估器会自主浏览页面,截取屏幕截图并仔细研究实现细节后才形成评估报告。这些反馈意见会作为下一轮迭代的输入参数回流给生成器。每代产品会进行5到15次迭代,每次迭代通常都会推动生成器朝着更具特色的方向发展以响应评估意见。由于评估器是在动态操作页面而非静态截图评分,每个周期都需要耗费实际时间。完整流程可能长达四小时。我还特别设定生成器在每次评估后做出战略决策:当评分呈现向好趋势时继续优化当前方向,若当前方案效果不佳则彻底转向全新的美学风格。

Across runs, the evaluator's assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didn't fully anticipate. Including phrases like "the best designs are museum quality" pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

在多次运行中,评估者的判断随着迭代逐步提升后趋于稳定,但仍有改进空间。部分生成结果呈现渐进式优化,而另一些则在迭代间发生了显著的美学转向。

评价标准的措辞以我未能完全预料的方式引导着生成器。诸如"最佳设计应达到博物馆藏品级别"这类表述将设计推向特定的视觉趋同,这表明与标准相关的提示词直接塑造了输出成果的特质。

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn't seen before from a single-pass generation.

随着迭代次数的增加,评分总体呈上升趋势,但改进过程并非总是呈现清晰的线性发展。虽然后期版本整体表现更优,但实践中我经常发现某些中间版本反而优于最终迭代。同时方案复杂度往往逐轮递增——生成器会根据评估者的反馈追求更宏大的解决方案。值得注意的是,首次迭代的输出质量已显著优于零提示基线,这说明评估标准及其配套表述本身就能引导模型突破通用模板的局限,无需等待评估反馈即可实现初步优化。

有个典型案例尤其令人印象深刻:当我要求模型为荷兰艺术博物馆设计网站时,经过九轮迭代后生成了一版符合预期的暗色调着陆页,视觉效果精致但中规中矩。然而在第十轮迭代中,它彻底颠覆原有思路,将网站重构为三维空间体验:运用CSS透视渲染出棋盘格地板,艺术品以自由布局悬置于虚拟墙面上,展厅之间采用门廊式导航替代传统滚动点击。这种突破常规的创意跃迁,在单次生成任务中实属罕见。

(原文,有一个视频;请点开链接查看。)

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

扩展至全栈编码

基于这些发现,我将这种受GAN启发的模式应用于全栈开发。生成器-评估器循环自然地映射到软件开发生命周期中,其中代码审查和质检扮演着与设计评估器相同的结构性角色。

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK's automatic compaction handling context growth along the way.

架构
在我们早期长期运行的测试框架中,我们通过初始化代理、逐功能处理的编码代理以及会话间的上下文重置,解决了连贯多会话编码问题。上下文重置曾是关键突破口:该框架使用Sonnet 4.5模型时存在前文所述的"上下文焦虑"现象。构建能跨重置保持良好表现的框架,是维持模型任务专注力的关键。而Opus 4.5模型已自主消除了该行为特征,因此新框架完全移除了上下文重置机制。所有代理在整个构建过程中以连续会话模式运行,期间依赖Claude Agent SDK的自动压缩功能管理上下文增长。

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I'd observed in prior runs. The system contained the following agent personas:

在这项工作中,我在原有框架的基础上构建了一个三智能体系统,每个智能体都针对我在先前运行中观察到的特定不足进行改进。该系统包含以下智能体角色:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs. (See example in the Appendix at the bottom.)

规划师:我们之前长期使用的框架要求用户预先提供详细规范。我希望自动化这一步骤,因此创建了一个规划代理,它能将简单的1-4句话提示扩展为完整的产品规范。我要求该代理在范围设定上保持雄心,并专注于产品上下文和高级技术设计,而非具体技术实现。这样强调是因为担心如果规划师试图预先规定技术细节并出现错误,规范中的错误会级联影响到下游实施。更明智的做法似乎是限定代理需要交付的成果,让它们在工作过程中自行探索实现路径。我还要求规划师在产品规范中寻找融入AI功能的机会。(具体示例参见文末附录。)

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

生成器:早期测试框架中采用的"一次实现一个功能"的方法在范围管理上效果良好。我在此应用了类似模式,指示生成器以迭代方式工作,每次从规范中选取一个功能进行实现。每个迭代周期使用React、Vite、FastAPI和SQLite(后期改用PostgreSQL)技术栈来构建应用,并要求生成器在每个迭代结束时先进行自我评估,再交付给质量保证团队。该系统还配备了Git进行版本控制。

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

评估员:早期的应用程序看起来往往令人印象深刻,但在实际使用时仍然存在真正的缺陷。为了发现这些问题,评估员使用Playwright MCP工具模拟用户操作点击运行中的应用程序,测试用户界面功能、API端点和数据库状态。随后,它根据发现的错误以及一套基于前端实验建模的标准对每个冲刺阶段进行评分,这些标准在此调整为涵盖产品深度、功能性、视觉设计和代码质量。每个标准都有严格的阈值,若任何一项未达标,则该冲刺阶段判定失败,生成器会收到关于问题细节的反馈。

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what "done" looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

在每个冲刺开始前,生成器和评估器会协商一份冲刺合约:在编写任何代码之前,就这部分工作"完成"的标准达成一致。之所以有这个环节,是因为产品规范有意保持高层次,而我需要一个步骤来弥合用户故事与可测试实现之间的差距。生成器提出它将构建什么内容以及如何验证成功,评估器则审查该提案以确保生成器正在构建正确的东西。双方反复讨论直至达成一致。

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

通信通过文件进行:一个代理会写入文件,另一个代理读取该文件并在该文件内或通过新文件进行响应,而前一个代理又会依次读取新文件。生成器随后根据双方商定的契约进行构建,再将工作移交给质量保证部门。这种方式既确保工作严格遵循规范,又避免过早过度指定实现细节。

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

运行测试框架
对于这个测试框架的第一个版本,我使用了Claude Opus 4.5,同时针对完整框架和单代理系统运行用户提示以进行比较。我选择Opus 4.5是因为在我开始这些实验时,它是我们最好的编码模型。

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

马具的价格贵了20多倍,但输出质量的差异立刻显而易见。

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

我期待的是一个可以构建关卡及其组件(精灵、实体、砖块布局)的界面,然后点击播放就能实际游玩该关卡。刚开始打开单人模式的输出时,初始应用程序似乎符合这些预期。

但随着我继续点击操作,问题开始浮现。界面布局浪费了大量空间,固定高度的面板导致视窗大部分区域留白。工作流程非常死板。尝试填充关卡时,系统提示我需要先创建精灵和实体,但用户界面没有任何指引说明这一操作顺序。更重要的是,实际游戏功能是损坏的。我的实体虽然显示在屏幕上,但没有任何响应输入。深入代码后发现,实体定义与游戏运行时的关联机制已损坏,且界面上没有任何地方显示问题所在。

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

评估了单人运行后,我将注意力转向了协作运行。这次运行从相同的单句提示开始,但规划步骤将该提示扩展为横跨十个冲刺阶段的16项功能规格,其范围远超单人运行的尝试。除了核心编辑器和游戏模式外,该规格还要求包含精灵动画系统、行为模板、音效与背景音乐、AI辅助精灵生成器和关卡设计器,以及可分享链接的游戏导出功能。我让规划器获取了我们的前端设计技能文档,它据此为应用程序创建了视觉设计语言作为规格的一部分。每个冲刺阶段中,生成器和评估器通过协商确定具体实现细节的合约条款,以及用于验证完成情况的可测试行为指标。

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I'd seen in the solo run did remain—the workflow still didn't make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

该应用立刻展现出比单独运行时更高的完善度和流畅性。画布充分利用了视口空间,面板尺寸设计合理,界面保持了与规范设计方向一致的视觉识别性。不过单独运行时出现的部分笨拙操作依然存在——工作流程仍未明确提示应在构建关卡前先创建精灵和实体,我不得不自行摸索发现这点。这反映出基础模型在产品直觉层面的不足(而非工具链设计缺陷),但也为针对性迭代提供了改进输出质量的潜在方向。

Working through the editors, the new run's advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because I'd asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract's test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator's findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

最大的区别在于游戏模式。我实际上能够移动角色并玩游戏。物理引擎还有些粗糙——我的角色跳上平台时竟与平台重叠了,这显然不对劲——但核心功能是可用的,而单人测试版本连这点都做不到。移动一阵后,我确实遇到了AI构建游戏关卡的局限性。有堵高墙我怎么都跳不过去,结果卡关了。这说明程序还需要处理些常识性改进和边界情况来完善应用。

查看日志时能明显看出评估器严格遵循了规范。每个冲刺阶段,它都会遍历冲刺合约的测试标准,通过Playwright运行应用程序,并对所有不符合预期行为的情况提交缺陷报告。这些合约非常细致——仅冲刺3就包含27条针对关卡编辑器的标准——评估器的发现足够具体,无需额外调查就能直接处理。下表展示了评估器识别出的若干问题示例:

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator's logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn't exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn't work, the lift was obvious.

让评估器达到这个水平需要下功夫。默认状态下,Claude是个糟糕的质检员。在早期测试中,我发现它能识别出合理问题,但总会自我说服这些问题无关紧要,最终批准通过。其测试也流于表面,不愿深挖边界情况,导致许多隐蔽漏洞成为漏网之鱼。我的调优方式是:研读评估日志,找出与本人判断存在分歧的案例,然后更新质检提示词来解决这些问题。经过数轮这样的开发循环后,评估器才终于能给出相对合理的评分。即便如此,测试框架的输出仍暴露出模型的质检局限:细微的布局问题、某些反直觉的交互设计,以及评估器未彻底检测的深层嵌套功能中的潜在漏洞。显然通过进一步调优还能提升验证效果。但与最初核心功能完全失效的原始版本相比,改进效果已十分显著。


Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as "find the simplest solution possible, and only increase complexity when needed," and it's a pattern that shows up consistently for anyone maintaining an agent harness.

改进约束装置

第一代约束装置的测试结果令人鼓舞,但其体积笨重、运行迟缓且成本高昂。合乎逻辑的下一步是寻找简化装置的方法,同时不降低其性能。这既源于常识,也遵循着更普遍的法则:约束装置中的每个组件都隐含着对模型自身局限性的预设,这些预设值得进行压力测试——既因为它们可能存在谬误,也因为随着模型进化它们会迅速过时。我们在博文《构建高效智能体》中将核心理念概括为"寻求最简解决方案,仅在必要时增加复杂性",这一原则在智能体约束系统的维护实践中反复得到印证。

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn't able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

在我第一次尝试简化时,我大幅削减了线束结构,并尝试了一些创新的新想法,但无法复现原版的性能表现。同时,我也难以辨别线束设计中哪些部分真正起承重作用,以及具体以何种方式承重。基于这次经验,我转向了更为系统的方法,每次只移除一个组件,并观察其对最终结果的影响。

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: "[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes." It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

在进行这些迭代周期时,我们还发布了Opus 4.6,这进一步促使我们减少工具链的复杂性。有充分理由预期4.6版本会比4.5版本需要更少的支撑结构。正如我们在发布博客中所说:"[Opus 4.6]能更谨慎地规划,能更持久地维持自主任务,能在更大的代码库中更可靠地运行,并具备更好的代码审查和调试能力来发现自己的错误。"该版本在长上下文检索方面也有显著提升。这些正是工具链原本设计用来补充的能力。

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

移除冲刺结构

我首先完全移除了冲刺结构。冲刺结构曾有助于将工作分解成小块,使模型能够连贯运作。鉴于Opus 4.6的改进,我们有充分理由相信模型无需这种分解就能原生处理任务。

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

我保留了规划器和评估器,因为这两者都在持续提供显著价值。没有规划器时,生成器会出现规划不足的情况:面对原始提示时,它会直接开始构建而不先制定工作方案,最终开发出的应用程序功能丰富度不如经过规划器处理的版本。

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model's raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator's check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

随着冲刺结构的取消,我将评估环节调整为仅在运行结束时进行单次评估,而非每个冲刺阶段都评分。由于模型能力大幅提升,评估环节在不同运行中的重要性也随之改变——其价值取决于具体任务与模型自主可靠完成能力之间的相对位置。在4.5版本中,这个边界非常接近:我们的构建任务刚好处于生成器能独立妥善完成的临界点,此时评估器能有效捕捉构建过程中的关键问题。而到了4.6版本,模型原生能力显著增强,边界随之外移。许多过去需要评估器核查才能保持连贯性的任务,现在已落入生成器独立处理的舒适区,对这些边界内的任务而言,评估器反而成了冗余开销。但对于那些仍处于生成器能力边缘的构建环节,评估器依然能提供实质性的质量提升。

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app's own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude's training data covers it thinly. But with enough tuning, the generator was building agents correctly.

实际意义在于,评估者并非非此即彼的二元判断。当任务超出当前模型独立可靠完成的范围时,这种评估成本就是值得的。

在结构简化的同时,我还通过提示词优化了测试框架将AI功能集成到每个应用的方式,特别是让生成器构建出能通过工具驱动应用自身功能的智能体。这需要反复调试,因为相关技术知识较新,Claude的训练数据覆盖有限。但经过充分调优后,生成器已能正确构建智能体。

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs.

Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can't be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren't edge cases — they're the core interactions that make a DAW usable, and the spec explicitly calls for them.

这是一款功能强大的应用,具有出色的设计还原度、可靠的AI代理和良好的后端支持。主要缺陷在于功能完整性——虽然应用看起来令人印象深刻且AI集成效果良好,但几个核心数字音频工作站(DAW)功能仅支持展示而缺乏交互深度:时间轴上的音频片段无法拖拽移动、缺少乐器控制面板(合成器旋钮、鼓垫)、也没有可视化效果编辑器(均衡器曲线、压缩器仪表)。这些并非边缘情况——它们正是使DAW具备可用性的核心交互功能,而技术规范已明确要求实现这些特性。

In its second round feedback, it again caught several functionality gaps:

Remaining gaps:
- Audio recording is still stub-only (button toggles but no mic capture)
- Clip resize by edge drag and clip split not implemented
- Effect visualizations are numeric sliders, not graphical (no EQ curve)

剩余差距:

  • 音频录制仍为桩模块(按钮可切换但未实现麦克风捕捉)
  • 片段边缘拖拽调整大小及片段拆分功能未实现
  • 效果可视化仅提供数值滑块,未实现图形化界面(如均衡器曲线)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agent's song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

根据提示,我期待的是一个可以创作旋律、和声和鼓点节奏,将它们编排成歌曲,并在此过程中获得集成助手帮助的程序。下方视频展示了最终成果。

这款应用程序远非专业的音乐制作软件,助手的歌曲创作能力显然还需要大幅提升。此外,Claude实际上不具备听觉能力,这使得在音乐品味方面的质量反馈循环效果欠佳。

但最终成型的应用程序具备了功能性音乐制作软件的所有核心模块:在浏览器中运行的编曲视图、混音器和播放控制器。更重要的是,我完全通过提示词就完成了一段简短的音乐片段:助手设定了速度和调性,谱写了旋律线,制作了鼓点轨道,调整了混音器电平,还添加了混响效果。歌曲创作的核心基础元素都已具备,助手能够自主驱动这些功能,运用工具完成从开端到结尾的简单音乐制作。你可能会说它还不够完美——但确实在不断进步。

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you're building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

未来展望
随着模型持续进步,我们大致可以预见它们将能处理更持久的任务和更复杂的挑战。在某些情况下,这意味着围绕模型搭建的辅助框架会逐渐变得不那么重要,开发者可以等待新一代模型问世,某些问题或将迎刃而解。另一方面,模型能力越强,就越需要开发能突破模型基础能力的控制体系来完成更复杂的任务。

基于此,本研究有几点经验值得延续:始终建议针对目标模型进行实验,观察其在真实场景中的运行轨迹,并通过调优实现预期效果;处理复杂任务时,将问题拆解并分配专业代理处理各环节往往能创造提升空间;当新模型发布时,通常需要重新评估控制框架,移除不再影响性能的冗余组件,并新增此前无法实现的增强模块。

这项研究让我确信:随着模型进化,有价值的控制组合设计空间不会萎缩,而是发生转移。人工智能工程师的精彩工作,就在于持续探索下一个创新组合的可能。

(译文严格遵循技术文本特征,采用"scaffold→辅助框架"、"harness→控制体系"等专业表述;谚语"problems solve themselves"译为"迎刃而解"符合中文习语习惯;通过"拆解/分配/移除/新增"等动词链清晰再现技术操作逻辑;末段"design space→设计空间"等术语处理保持学术文本准确性)


Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.


Appendix

Example plan generated by planner agent.

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard & Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection 
All associated sprites, tilesets, levels, and entity definitions

...

CopyExpand

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

Please provide your email address if you'd like to receive our monthly developer newsletter. You can unsubscribe at any time.

Harness design for long-running application development

https://www.anthropic.com/engineering/harness-design-long-running-apps

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐