Training Cost And Time Compression Versus A Fixed Capability Metric

Issue 31 Edition 2026-01-31 3 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-02-06 16:59

Key takeaways

In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.

Sections

Training Cost And Time Compression Versus A Fixed Capability Metric

The deltas jointly assert a large reduction in resources required to exceed GPT-2 on a stated metric (CORE), anchored by a 2019 GPT-2 training cost/time estimate and a recent nanochat cost/time estimate on 8xH100 hardware. The derived 600× and ~2.5×/year figures depend on the comparability of the two endpoints and on consistent measurement and costing assumptions, which are not demonstrated here.

In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.

Unknowns

What exactly is the CORE metric (definition, tasks, aggregation), and is it the same evaluation setup for both the GPT-2 baseline and the nanochat result?
Are the GPT-2 compute/time/cost numbers directly sourced from a reproducible log or a consistent accounting framework, or are they rough retrospective estimates?
What are the exact cost assumptions behind the '$73 on a single 8xH100 node' figure (pricing basis, utilization, included/excluded overheads)?
What specific changes (e.g., training recipe, architecture, data pipeline) in modded-nanogpt/nanochat are responsible for the speed/cost improvement, and which are necessary versus incidental?
Are there intermediate datapoints across the seven-year span that support the implied ~2.5×/year decline rate, or is this a two-point extrapolation?

Investor overlay

Read-throughs

If the metric and evaluation are comparable, rapid cost and time compression to reach GPT-2-level capability could accelerate model iteration cycles and lower barriers for teams with limited budgets.
A claimed multi-year decline rate in training cost for a fixed capability metric could indicate increasing returns from software and training recipe improvements, not only hardware progress.

What would confirm

A clear definition of the CORE metric plus identical evaluation setup for the GPT-2 baseline and the nanochat result, including tasks, aggregation, and dataset versions.
Reproducible logs or a consistent accounting framework validating GPT-2 training compute, time, and cost, and validating the nanochat 3.04 hours and 73 cost figure including pricing basis and utilization.
Breakdown of which specific nanochat and modded-nanogpt changes drive the improvement, with ablation results showing necessity versus incidental optimizations.

What would kill

CORE metric or evaluation setup differs between the two endpoints, making the comparison and the 600x reduction claim not comparable.
Cost assumptions for the 73 figure exclude major items or are not repeatable, or GPT-2 cost and time estimates are shown to be rough and inconsistent.
Intermediate datapoints fail to support the implied 2.5x per year trend, indicating a two-point extrapolation without stable underlying trajectory.

Sources

Quoting Andrej Karpathy

2026-01-31 simonwillison.net