Rosa Del Mar

Issue 31 2026-01-31

Rosa Del Mar

Daily Brief

Issue 31 2026-01-31

Training Cost And Time Compression Versus A Fixed Capability Metric

Issue 31 Edition 2026-01-31 3 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-02-06 16:59

Key takeaways

  • In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
  • With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
  • The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.

Sections

Training Cost And Time Compression Versus A Fixed Capability Metric

The deltas jointly assert a large reduction in resources required to exceed GPT-2 on a stated metric (CORE), anchored by a 2019 GPT-2 training cost/time estimate and a recent nanochat cost/time estimate on 8xH100 hardware. The derived 600× and ~2.5×/year figures depend on the comparability of the two endpoints and on consistent measurement and costing assumptions, which are not demonstrated here.

  • In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
  • With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
  • The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.

Unknowns

  • What exactly is the CORE metric (definition, tasks, aggregation), and is it the same evaluation setup for both the GPT-2 baseline and the nanochat result?
  • Are the GPT-2 compute/time/cost numbers directly sourced from a reproducible log or a consistent accounting framework, or are they rough retrospective estimates?
  • What are the exact cost assumptions behind the '$73 on a single 8xH100 node' figure (pricing basis, utilization, included/excluded overheads)?
  • What specific changes (e.g., training recipe, architecture, data pipeline) in modded-nanogpt/nanochat are responsible for the speed/cost improvement, and which are necessary versus incidental?
  • Are there intermediate datapoints across the seven-year span that support the implied ~2.5×/year decline rate, or is this a two-point extrapolation?

Investor overlay

Read-throughs

  • If the metric and evaluation are comparable, rapid cost and time compression to reach GPT-2-level capability could accelerate model iteration cycles and lower barriers for teams with limited budgets.
  • A claimed multi-year decline rate in training cost for a fixed capability metric could indicate increasing returns from software and training recipe improvements, not only hardware progress.

What would confirm

  • A clear definition of the CORE metric plus identical evaluation setup for the GPT-2 baseline and the nanochat result, including tasks, aggregation, and dataset versions.
  • Reproducible logs or a consistent accounting framework validating GPT-2 training compute, time, and cost, and validating the nanochat 3.04 hours and 73 cost figure including pricing basis and utilization.
  • Breakdown of which specific nanochat and modded-nanogpt changes drive the improvement, with ablation results showing necessity versus incidental optimizations.

What would kill

  • CORE metric or evaluation setup differs between the two endpoints, making the comparison and the 600x reduction claim not comparable.
  • Cost assumptions for the 73 figure exclude major items or are not repeatable, or GPT-2 cost and time estimates are shown to be rough and inconsistent.
  • Intermediate datapoints fail to support the implied 2.5x per year trend, indicating a two-point extrapolation without stable underlying trajectory.

Sources

  1. 2026-01-31 simonwillison.net