Platformization And Production Controls (Gateway, Evals, Grounding)

Issue 17 Edition 2026-01-17 8 min read

General

Sources: 1 • Confidence: High • Updated: 2026-02-06 16:59

Key takeaways

Brex reports that foundation models’ general world-knowledge about Brex can be outdated or incorrect, and Brex is building and curating a documentation corpus to ground multiple LLM applications.
Brex’s operational AI approach relies on decomposing work into granular, auditable SOPs that map cleanly to LLM prompting, often solvable with simple tool-using agents or single-turn completions.
Brex reports second-order effects of agentic coding adoption including increased code 'slop,' reduced review rigor, and drift in shared understanding that can impair incident response.
Brex is split between running some agent-layer applications on Mastra and others on an internal framework optimized for multi-agent orchestration.
Brex expects corporate AI tooling across business functions to deliver roughly 10x workflow improvements.

Sections

Platformization And Production Controls (Gateway, Evals, Grounding)

Brex built platform primitives (LLM gateway, prompt management, observability, and self-serve ops tooling) and is evolving eval practices toward regression prevention and production release gating. Separately, it identifies grounding as necessary because public model knowledge about Brex is incorrect/outdated, and it is considering unifying fragmented knowledge sources to reduce inconsistency.

Brex reports that foundation models’ general world-knowledge about Brex can be outdated or incorrect, and Brex is building and curating a documentation corpus to ground multiple LLM applications.
Brex’s product/process knowledge is fragmented across multiple documentation channels, and leadership is considering a unified source-of-truth strategy.
Brex reports increasing rigor around avoiding regressions as it moves from an AI lab/incubator mode to shipping production AI, with evals as an area of ongoing change.
Brex built an internal LLM gateway and prompt-management infrastructure starting around January 2023 for prompt versioning, evals, data egress control, model routing, observability, and cost monitoring.
Brex built much of its internal agent platform UI in Retool with prompt, tool, and evaluation managers so ops domain experts can refine prompts and run evals on new models without engineers.
For operational AI agents, Brex bakes evals into the platform around each prompt/agent, co-develops initial eval sets with ops SMEs and engineers, and adds regression evals when QA finds mistakes.

Regulated-Ops Automation As Near-Term Ai Roi Path

The corpus emphasizes operational AI in underwriting/KYC-like workflows and adjacent processes, with an explicit automation SLA target and an explicit constraint to preserve CSAT. Prioritization is described as volume/critical-path driven (common workflows first, disputes later), and automation is tied to making previously ROI-negative commercial segments viable.

Brex’s operational AI approach relies on decomposing work into granular, auditable SOPs that map cleanly to LLM prompting, often solvable with simple tool-using agents or single-turn completions.
Brex uses agents to automate customer application evaluation for instant onboarding that previously required humans for underwriting or KYC.
Brex has a target of 80% fully automated acceptance for startup and commercial applicants with a touchless underwriting decision within 60 seconds.
Brex is pursuing support and operations automation under the constraint that customer experience and CSAT remain high.
Brex prioritizes automation by focusing first on tasks that are most common across the broadest number of customers.
Brex’s early automation targets included researching customers to assess business legitimacy and screen out businesses they legally cannot or choose not to serve.

Agentic Coding: Organization-Wide Adoption Plus Quality Risks

Brex describes broad agentic coding adoption and explicit process/tooling countermeasures (repo rules, linters, AI code review). It also flags quality and incident-response risks from AI-assisted development and frames productivity gains as non-automatic. The headcount stance presented is efficiency gains without engineering headcount growth rather than headcount reduction.

Brex reports second-order effects of agentic coding adoption including increased code 'slop,' reduced review rigor, and drift in shared understanding that can impair incident response.
Brex states it has grown the business without growing engineering headcount, aiming to keep about 300 engineers while becoming 30–100% more efficient rather than reducing headcount due to AI leverage.
Brex believes agentic development amplifies bad engineering as much as good outcomes, so net capacity gains are nuanced rather than obviously large.
Brex redesigned its engineering interview loop into an agentic-coding-native project where candidates are expected to use agentic coding and are evaluated on how they understand generated code.
Brex had all engineers and engineering managers complete the new agentic-coding interview internally for familiarity, without pass/fail scoring.
Brex uses explicit repository rules, linters, and a standardized AI code review tool called Reptile as part of its engineering quality controls.

Agent Architecture: Specialization And Orchestration

Brex reports that a single generalist assistant performed poorly and moved to orchestrated specialized sub-agents, valuing multi-turn inter-agent clarification. The implementation reality includes a split between Mastra and an internal multi-agent orchestration framework, with polyglot integration constraints from existing backend stacks.

Brex is split between running some agent-layer applications on Mastra and others on an internal framework optimized for multi-agent orchestration.
Brex found that a single generalist assistant performed poorly across diverse product lines and decomposed the assistant into specialized sub-agents coordinated by an orchestrator.
Brex values multi-turn agent-to-agent conversations because sub-agents often must request clarifications, which is not well served by single-shot tool/RPC calls.
For the agentic layer announced in Brex’s fall release, Brex is building in TypeScript and using Mastra as a primary framework for part of the system.
Brex’s existing backend codebases are primarily Kotlin or Elixir, and its retrieval layer uses a mix including PGVector and Pinecone.

Ai Strategy Framing And Ownership Split

Brex describes a formal three-pillar AI strategy and assigns corporate AI procurement/enablement outside the CTO org, while product and operational AI execution sits with the CTO org. This is a shift from treating LLM work as isolated experiments toward a communicable governance frame with clearer accountability.

Brex expects corporate AI tooling across business functions to deliver roughly 10x workflow improvements.
Brex organizes its AI strategy into three pillars: corporate AI adoption, operational AI cost reduction, and product AI features for customers.
Brex positions product AI as making Brex part of customers’ corporate AI strategies and board-level narratives.
Brex has an internal ownership split where IT and People lead corporate AI procurement and experimentation culture, while the CTO’s org emphasizes operational and product AI execution.

Watchlist

Brex is observing a slowdown in employees trying new AI tools because employees become attached to the ergonomics of their current workflow even when better models appear.
Brex reports second-order effects of agentic coding adoption including increased code 'slop,' reduced review rigor, and drift in shared understanding that can impair incident response.
Brex reports that foundation models’ general world-knowledge about Brex can be outdated or incorrect, and Brex is building and curating a documentation corpus to ground multiple LLM applications.
Brex’s product/process knowledge is fragmented across multiple documentation channels, and leadership is considering a unified source-of-truth strategy.
Brex reports increasing rigor around avoiding regressions as it moves from an AI lab/incubator mode to shipping production AI, with evals as an area of ongoing change.

Unknowns

What measured productivity deltas (cycle time, throughput, quality) correspond to Brex’s corporate AI “~10x workflow improvements” expectation across specific functions?
How close is Brex to its stated underwriting automation target (80% touchless decisions within 60 seconds), and what are the downstream risk outcomes (loss rates, fraud, compliance exceptions)?
What CSAT and customer-experience metrics are used to gate support/ops automation, and how have those metrics trended as automation coverage increased?
How effective are Brex’s grounding efforts (curated documentation corpus) in reducing incorrect answers about products and ICP, and how is accuracy monitored over time?
How broadly are the LLM gateway circuit breakers used in production, and what categories of failures do they prevent versus allowing through?

Investor overlay

Read-throughs

Enterprise AI is shifting from pilots to governed production via LLM gateways, evals, and grounding, creating demand for tooling that enables regression gating, observability, and safe release processes.
Grounding and knowledge unification are becoming core bottlenecks for internal and customer-facing AI, implying growing spend on documentation corpus curation and source-of-truth platforms.
Agentic coding is scaling but introduces quality and incident-response risks, implying rising demand for AI-aware code review, policy enforcement, and reliability tooling to counter code slop.

What would confirm

Brex reports concrete, measured productivity deltas by function and uses evals as release gates with tracked regression rates and incident reduction over time.
Brex shows improved answer accuracy from grounded documentation, with monitoring that demonstrates reduced product and ICP errors as knowledge sources are unified.
Brex demonstrates underwriting and ops automation progress toward the touchless target while maintaining customer experience gates and stable or improving risk outcomes.

What would kill

Evals and gateway controls remain experimental and fail to prevent regressions, leading to repeated production incidents or rollback-heavy releases.
Grounding and source-of-truth efforts do not measurably reduce incorrect answers, or fragmentation persists and limits scaling of AI across functions.
Agentic coding quality issues overwhelm mitigations, with worsening review rigor and impaired incident response that offsets productivity gains and slows shipping.

Sources

Brex’s AI Hail Mary — With CTO James Reggio

2026-01-17