Rosa Del Mar

Issue 1 2025-07-07

Rosa Del Mar

Daily Brief

Issue 1 2025-07-07

Weka’s augmented-memory approach claims to extend DRAM-class memory to GPUs via the compute network, creati

Issue 1 Edition 2025-07-07 7 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-02-06 16:59

Key takeaways

  • Weka’s augmented-memory approach claims to extend DRAM-class memory to GPUs via the compute network, creating a larger network-accessible DRAM pool than local motherboard DRAM.
  • Repeated prefill to rebuild KV cache is a major source of inference waste and slowness, and an ideal is a single prefill followed by indefinite decode.
  • Disaggregated prefill-and-decode inference is mostly a 2025 production phenomenon despite earlier research papers.
  • Providers typically preserve prefix/context caches for about five to fifteen minutes before the cache is lost.
  • The inference-serving ecosystem has shifted significant momentum toward open-source vLLM, alongside alternatives like SGLang, beyond NVIDIA’s Triton and TensorRT-LLM.

Sections

Network As Expansion Plane And Nvme Dram Hbm Tiering

The corpus argues that in GPU clusters the compute network can function as the main expansion plane and that NVMe/DRAM/HBM tiering can extend KV-cache residency to reduce re-prefill. It also contains vendor-specific claims about using the compute network and local NVMe to expose a larger memory tier to GPUs, plus a stated third-party validation by a cloud provider. The operational benefits (continuous decode, better batching/latency) are asserted as outcomes but remain to be empirically established within this corpus.

  • Weka’s augmented-memory approach claims to extend DRAM-class memory to GPUs via the compute network, creating a larger network-accessible DRAM pool than local motherboard DRAM.
  • Oracle Cloud Infrastructure publicly benchmarked and validated Weka’s augmented-memory results according to Weka and OCI-published materials.
  • Weka claims that by leveraging the compute network rather than the storage network it can deliver effectively limitless KV cache enabling very large context windows outside of Google-like infrastructures.
  • In Weka’s converged mode, installing Weka software on GPU servers aggregates their local NVMe drives and exposes them as a DRAM-class software-defined memory tier to inference stacks.
  • If inference can be shifted toward continuous decode, providers can increase aggregate tokens per second, run larger batch sizes, support more users, and reduce time-to-first and time-to-last token.
  • On modern GPU servers, transfer between HBM and DRAM is on the order of microseconds, narrowing the effective latency gap versus NVMe which is microsecond-class in isolation.

Prefill Decode Bifurcation And Kv Cache As Core Serving Constraint

The corpus decomposes transformer inference into a compute-bound prefill phase and a memory-bandwidth-bound decode phase and explains that prefill builds large KV working sets. It also states that multi-turn/agentic usage increases KV-cache pressure and that repeated re-prefill is a major efficiency loss. These deltas jointly reframe inference scaling as primarily a memory hierarchy and cache persistence problem, not only a FLOPs problem.

  • Repeated prefill to rebuild KV cache is a major source of inference waste and slowness, and an ideal is a single prefill followed by indefinite decode.
  • Agentic and multi-turn high-context workloads are becoming a dominant inference pattern in 2025 and they increase KV cache pressure due to repeated reuse of large contexts.
  • In transformer inference, prefill is compute-bound while decode is primarily memory-bandwidth-bound.
  • Prefill constructs the KV cache by adding high-dimensional context to tokens, expanding token data into gigabytes of working memory that must be stored in memory for fast access.
  • Modern GPU inference software manages three memory tiers—SRAM, on-package HBM, and shared system DRAM—and uses KV cache managers to reduce KV eviction and re-prefill events.
  • In agentic multi-turn inference, the KV cache grows rapidly because the full session context repeatedly approaches the model’s context window.

Disaggregated Prefill Decode As 2025 Operational Shift

The corpus claims disaggregated prefill/decode is becoming operational in 2025 for transformer workloads and provides the mechanism (separate scaling/optimization of compute-heavy and memory-heavy stages). It also states a potential operational consequence: mixing GPU generations across stages. The expected performance/cost outcomes are stated but not quantified in this corpus.

  • Disaggregated prefill-and-decode inference is mostly a 2025 production phenomenon despite earlier research papers.
  • Disaggregated prefill-and-decode applies primarily to transformer-based models rather than diffusion-based image/video models.
  • Disaggregated prefill-and-decode decouples the compute-intensive prefill phase from the memory-intensive decode phase so each can be scaled and optimized independently.
  • With disaggregated prefill-and-decode, operators can mix GPU generations by using newer GPUs for prefill and older GPUs for decode to extend the useful life of prior accelerators.
  • Disaggregated prefill-and-decode is expected to materially increase token throughput, improve memory utilization, and lower the cost per token for inference serving.
  • Disaggregated prefill-and-decode introduces an assembly-line workflow by streaming work through specialized prefill and decode stages instead of performing both on the same resources.

Token Pricing And Cache Policies Reflect Bottlenecks

The corpus describes a three-part token pricing model and the recent commercial introduction of cached-token pricing with large discounts, alongside short cache retention windows. These deltas imply that prefill avoidance is valuable but operationally constrained, and that pricing is being used to steer workload patterns toward reuse where the provider can support it.

  • Providers typically preserve prefix/context caches for about five to fifteen minutes before the cache is lost.
  • Cached input token pricing commonly provides about a 75% discount versus standard input token pricing.
  • Major model providers price inference with separate rates for input tokens, cached input tokens, and output tokens.
  • Context/prefix caching pricing was introduced roughly six to nine months ago to avoid reprocessing the same input repeatedly.
  • DeepSeek publicly introduced cached-input pricing and later disclosed implementation details in an open-source disclosure series.

Metrics And Open Source Standardization Of Serving

The corpus reports momentum toward open-source inference servers (especially vLLM) and adoption of an inefficiency metric (XPYD) originating with DeepSeek. These deltas suggest increasing standardization in how inference systems are discussed and compared, with potential downstream effects on what gets optimized (re-prefill frequency) and where differentiation can persist.

  • The inference-serving ecosystem has shifted significant momentum toward open-source vLLM, alongside alternatives like SGLang, beyond NVIDIA’s Triton and TensorRT-LLM.
  • DeepSeek proposed the XPYD metric to describe how many times a session must be re-prefilled, and vLLM community discussions have adopted this framing.
  • The open-source community, notably vLLM, has adopted DeepSeek’s XPYD framing for discussing inference scaling via the prefill-to-decode relationship.

Watchlist

  • Which AI provider first defines and leads a new lower pricing class is an open competitive question that will determine pricing followers versus leaders in the next pricing regime.

Unknowns

  • What empirical evidence supports the claimed 80/20 inference-versus-training spend split, and how does it vary by provider/workload?
  • How much token-throughput improvement and cost-per-token reduction does disaggregated prefill/decode deliver at fixed latency SLOs in production?
  • What are the dominant technical causes of short provider cache retention windows (five to fifteen minutes), and what changes would be required to extend them materially?
  • To what extent can NVMe act as a practical KV-cache tier without unacceptable tail-latency or throughput regressions for interactive inference?
  • What is the reproducible benchmark methodology and workload profile behind the claimed cloud validation of augmented-memory results?

Investor overlay

Read-throughs

  • Inference cost and latency may be increasingly constrained by memory hierarchy and KV cache persistence rather than FLOPs, shifting value toward vendors that extend effective memory and reduce repeated prefill waste.
  • Disaggregated prefill and decode could become a 2025 production architecture, enabling different optimization and scaling for compute heavy versus memory bandwidth heavy stages and potentially allowing mixed GPU generations per stage.
  • Open source inference serving software, especially vLLM, may be standardizing serving stacks and metrics, reducing differentiation from proprietary serving layers and shifting competition to infrastructure, cache policy, and pricing.

What would confirm

  • Production disclosures show measurable token throughput gains and lower cost per token at fixed latency targets after adopting disaggregated prefill and decode, including reduced re-prefill rates and longer effective cache reuse.
  • Providers extend prefix or context cache retention windows materially beyond five to fifteen minutes, or increase cached token discounting, indicating operational confidence in cache persistence and reuse economics.
  • Independent, reproducible benchmarks validate augmented memory or tiered NVMe DRAM HBM approaches for KV cache with acceptable tail latency, showing sustained decode performance and reduced prefill repetition.

What would kill

  • Real world deployments show disaggregated prefill and decode fails to improve cost per token or throughput once latency SLOs and tail latency are enforced, or introduces operational complexity that outweighs benefits.
  • KV cache persistence remains constrained to short retention windows with no practical path to extension, keeping repeated prefill a dominant cost and preventing continuous decode style serving.
  • Benchmarks or cloud validations of augmented memory and NVMe tiering prove non reproducible or show unacceptable tail latency and throughput regressions, limiting usefulness for interactive inference.

Sources

  1. 2025-07-07 fabricatedknowledge.com