Research Positioning And Evaluation Task Design

Issue 23 Edition 2026-01-23 6 min read

General

Sources: 1 • Confidence: High • Updated: 2026-02-06 16:59

Key takeaways

FastRender can render real websites without a working JavaScript engine, but pages are somewhat slow to display in that mode.
At peak stability, the system ran about 2,000 agents concurrently for roughly a week and produced thousands of commits per hour, totaling nearly 30,000 commits.
Most commits did not produce merge conflicts despite many agents working on the same repository because the harness decomposed tasks to minimize overlapping code changes.
An agent disabled JavaScript behind a feature flag to allow progress on other browser components while the JavaScript engine was still under development.
FastRender embeds web standards as git submodules and uses them as an ongoing feedback loop, with AI-written code frequently referencing those specifications.

Sections

Research Positioning And Evaluation Task Design

The project is positioned as an observability-oriented benchmark with visual feedback rather than a product. The delta is that ambitious, spec-heavy domains with visible evaluation signals are used to stress-test and improve multi-agent harness behavior over time.

FastRender can render real websites without a working JavaScript engine, but pages are somewhat slow to display in that mode.
FastRender started as Wilson Lin’s personal side project around November to test frontier models on an ambitious, well-specified task with visual feedback.
After single-agent progress improved, FastRender became an official Cursor research project to explore multi-agent scaling with more resources.
FastRender was intended as a research benchmark to observe and experiment with multi-agent behaviors at scale over time rather than as a production browser competitor.

Swarm Scale And Throughput

The corpus provides concrete scale and output metrics (agents, duration, commits, lines of code) and describes an infrastructure pattern for high concurrency. The implied delta is that multi-agent runs can persist for days and generate very high change volume, shifting perceived bottlenecks from individual coding to orchestration and infrastructure capacity.

At peak stability, the system ran about 2,000 agents concurrently for roughly a week and produced thousands of commits per hour, totaling nearly 30,000 commits.
Infrastructure scaling used large machines running one harness each with about 300 concurrent agents per machine because agents spend much of their time thinking rather than executing tools.
In a few weeks, the agent swarm produced over a million lines of Rust that can already render real web pages, and the browser served primarily as a research objective for improving the multi-agent harness.

Coordination Architecture And Conflict Avoidance

A hierarchical planner/worker tree plus deliberate task decomposition is presented as the coordination mechanism that reduces code overlap and thus merge conflicts. The delta is that high parallelism is described as feasible when planning explicitly targets non-overlapping changes.

Most commits did not produce merge conflicts despite many agents working on the same repository because the harness decomposed tasks to minimize overlapping code changes.
Agents are organized in a tree in which planning agents spawn tasks and worker agents execute them, and multiple harnesses can run in parallel on separate machines for different work streams.
Effective decomposition into non-overlapping chunks by planning agents was described as the key to unlocking benefits from hundreds or thousands of parallel agents.

Autonomy Constraints And Project Management Behaviors

The corpus states that runs cannot be interactively steered and highlights emergent tactics: feature-flagging to unblock progress and introduction of temporary components to unblock local work. The delta is that practical project-management patterns can appear in agent swarms, but they operate under strict control constraints that make upfront specification and guardrails more important.

An agent disabled JavaScript behind a feature flag to allow progress on other browser components while the JavaScript engine was still under development.
One agent introduced QuickJS as a temporary JavaScript engine to unblock its own work while other agents built a longer-term engine intended to replace it.
During a run, the system could not be steered via prompting; the only human intervention available was stopping it, and the longest autonomous run lasted about a week.

Feedback Loops For Long Horizon Correctness

The system uses three explicit automated feedback loops: standards embedded as repo artifacts, visual golden-sample comparisons, and compiler errors from Rust. The delta is that autonomous progress is tightly coupled to abundant, automatable signals that can be consumed without mid-run human prompting.

FastRender embeds web standards as git submodules and uses them as an ongoing feedback loop, with AI-written code frequently referencing those specifications.
The system used vision feedback by taking rendering screenshots and comparing them against golden samples to drive iterative improvements.
Using Rust provided verification via compilation errors that served as an important feedback loop for autonomous agents.

Unknowns

What objective correctness metrics were achieved (e.g., which sites work, rendering conformance measures, failure rates, time-to-first-paint distributions) for the no-JavaScript rendering capability?
What were the actual merge-conflict and integration-failure rates as agent count scaled, and how did those rates change over time?
What were the operational costs and resource utilization characteristics (CPU/GPU usage, tool-call rates, memory) per machine and per agent during peak runs?
How frequently did intermittent compilation/API errors occur, what was the mean time to repair, and what safeguards prevented prolonged breakage?
What was the evaluation protocol for the claim that general GPT-5.1/5.2 models outperformed GPT-5.1-Codex (task set, duration, success metrics, number of runs)?

Investor overlay

Read-throughs

Multi agent software development harnesses may be approaching sustained, high throughput operation, shifting value from model quality to orchestration, tooling, and evaluation loops in spec heavy domains.
Benchmarking with abundant automated signals such as standards artifacts, visual comparisons, and compiler errors may become a practical way to drive long horizon autonomy and measure progress without mid run steering.
Feature flagging and task decomposition appear effective for reducing integration friction at scale, implying process and infrastructure differentiation may matter as much as raw coding capability.

What would confirm

Published objective correctness metrics for no JavaScript rendering, including site coverage, conformance measures, failure rates, and time to first paint distributions that improve over time.
Reported merge conflict and integration failure rates versus agent count, showing stable or improving outcomes during multi day runs with thousands of commits per hour.
Transparent cost and utilization data per agent and per machine, plus mean time to repair for intermittent build or API errors, demonstrating operationally sustainable scaling.

What would kill

Correctness evidence shows poor site compatibility or unstable rendering outcomes, making visual feedback loops insufficient to prevent regressions at scale.
Merge conflicts or integration failures rise sharply with concurrency, forcing frequent human intervention and undermining the claimed feasibility of persistent high parallelism.
Operational costs per agent are prohibitive or error recovery is slow and recurrent, preventing multi day autonomous runs from being economically or technically repeatable.

Sources

Wilson Lin on FastRender: a browser built by thousands of parallel agents

2026-01-23 simonwillison.net