Training Cost And Time Compression Versus A Fixed Capability Metric
Sources: 1 • Confidence: Medium • Updated: 2026-02-06 16:59
Key takeaways
- In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
- With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
- The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.
Sections
Training Cost And Time Compression Versus A Fixed Capability Metric
The deltas jointly assert a large reduction in resources required to exceed GPT-2 on a stated metric (CORE), anchored by a 2019 GPT-2 training cost/time estimate and a recent nanochat cost/time estimate on 8xH100 hardware. The derived 600× and ~2.5×/year figures depend on the comparability of the two endpoints and on consistent measurement and costing assumptions, which are not demonstrated here.
- In 2019, GPT-2 training reportedly used 32 TPU v3 chips for 168 hours at about $8 per TPU v3-hour, totaling roughly $43,000.
- With recent improvements merged into nanochat (many originating from modded-nanogpt), a higher CORE score than GPT-2 can be reached in about 3.04 hours for roughly $73 on a single 8xH100 node.
- The cost to train a GPT-2-level model is claimed to have fallen by about 600× over seven years, corresponding to an approximate 2.5× reduction per year.
Unknowns
- What exactly is the CORE metric (definition, tasks, aggregation), and is it the same evaluation setup for both the GPT-2 baseline and the nanochat result?
- Are the GPT-2 compute/time/cost numbers directly sourced from a reproducible log or a consistent accounting framework, or are they rough retrospective estimates?
- What are the exact cost assumptions behind the '$73 on a single 8xH100 node' figure (pricing basis, utilization, included/excluded overheads)?
- What specific changes (e.g., training recipe, architecture, data pipeline) in modded-nanogpt/nanochat are responsible for the speed/cost improvement, and which are necessary versus incidental?
- Are there intermediate datapoints across the seven-year span that support the implied ~2.5×/year decline rate, or is this a two-point extrapolation?