Research Positioning And Evaluation Task Design
Key takeaways
- FastRender can render real websites without a working JavaScript engine, but pages are somewhat slow to display in that mode.
- At peak stability, the system ran about 2,000 agents concurrently for roughly a week and produced thousands of commits per hour, totaling nearly 30,000 commits.
- Most commits did not produce merge conflicts despite many agents working on the same repository because the harness decomposed tasks to minimize overlapping code changes.
- An agent disabled JavaScript behind a feature flag to allow progress on other browser components while the JavaScript engine was still under development.
- FastRender embeds web standards as git submodules and uses them as an ongoing feedback loop, with AI-written code frequently referencing those specifications.
Sections
Research Positioning And Evaluation Task Design
The project is positioned as an observability-oriented benchmark with visual feedback rather than a product. The delta is that ambitious, spec-heavy domains with visible evaluation signals are used to stress-test and improve multi-agent harness behavior over time.
- FastRender can render real websites without a working JavaScript engine, but pages are somewhat slow to display in that mode.
- FastRender started as Wilson Lin’s personal side project around November to test frontier models on an ambitious, well-specified task with visual feedback.
- After single-agent progress improved, FastRender became an official Cursor research project to explore multi-agent scaling with more resources.
- FastRender was intended as a research benchmark to observe and experiment with multi-agent behaviors at scale over time rather than as a production browser competitor.
Swarm Scale And Throughput
The corpus provides concrete scale and output metrics (agents, duration, commits, lines of code) and describes an infrastructure pattern for high concurrency. The implied delta is that multi-agent runs can persist for days and generate very high change volume, shifting perceived bottlenecks from individual coding to orchestration and infrastructure capacity.
- At peak stability, the system ran about 2,000 agents concurrently for roughly a week and produced thousands of commits per hour, totaling nearly 30,000 commits.
- Infrastructure scaling used large machines running one harness each with about 300 concurrent agents per machine because agents spend much of their time thinking rather than executing tools.
- In a few weeks, the agent swarm produced over a million lines of Rust that can already render real web pages, and the browser served primarily as a research objective for improving the multi-agent harness.
Coordination Architecture And Conflict Avoidance
A hierarchical planner/worker tree plus deliberate task decomposition is presented as the coordination mechanism that reduces code overlap and thus merge conflicts. The delta is that high parallelism is described as feasible when planning explicitly targets non-overlapping changes.
- Most commits did not produce merge conflicts despite many agents working on the same repository because the harness decomposed tasks to minimize overlapping code changes.
- Agents are organized in a tree in which planning agents spawn tasks and worker agents execute them, and multiple harnesses can run in parallel on separate machines for different work streams.
- Effective decomposition into non-overlapping chunks by planning agents was described as the key to unlocking benefits from hundreds or thousands of parallel agents.
Autonomy Constraints And Project Management Behaviors
The corpus states that runs cannot be interactively steered and highlights emergent tactics: feature-flagging to unblock progress and introduction of temporary components to unblock local work. The delta is that practical project-management patterns can appear in agent swarms, but they operate under strict control constraints that make upfront specification and guardrails more important.
- An agent disabled JavaScript behind a feature flag to allow progress on other browser components while the JavaScript engine was still under development.
- One agent introduced QuickJS as a temporary JavaScript engine to unblock its own work while other agents built a longer-term engine intended to replace it.
- During a run, the system could not be steered via prompting; the only human intervention available was stopping it, and the longest autonomous run lasted about a week.
Feedback Loops For Long Horizon Correctness
The system uses three explicit automated feedback loops: standards embedded as repo artifacts, visual golden-sample comparisons, and compiler errors from Rust. The delta is that autonomous progress is tightly coupled to abundant, automatable signals that can be consumed without mid-run human prompting.
- FastRender embeds web standards as git submodules and uses them as an ongoing feedback loop, with AI-written code frequently referencing those specifications.
- The system used vision feedback by taking rendering screenshots and comparing them against golden samples to drive iterative improvements.
- Using Rust provided verification via compilation errors that served as an important feedback loop for autonomous agents.
Unknowns
- What objective correctness metrics were achieved (e.g., which sites work, rendering conformance measures, failure rates, time-to-first-paint distributions) for the no-JavaScript rendering capability?
- What were the actual merge-conflict and integration-failure rates as agent count scaled, and how did those rates change over time?
- What were the operational costs and resource utilization characteristics (CPU/GPU usage, tool-call rates, memory) per machine and per agent during peak runs?
- How frequently did intermittent compilation/API errors occur, what was the mean time to repair, and what safeguards prevented prolonged breakage?
- What was the evaluation protocol for the claim that general GPT-5.1/5.2 models outperformed GPT-5.1-Codex (task set, duration, success metrics, number of runs)?