Product Packaging And Access Modes (Open-Weights Vs Hosted)

Issue 37 Edition 2026-02-06 5 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-02-06 16:59

Key takeaways

Mistral released Voxtral Transcribe 2, a family of two new audio-to-text transcription models, as a sequel to the original Voxtral from July 2025.
The Mistral transcription API supports diarization, context biasing, and segment-level timestamp granularities via request parameters.
Voxtral transcription via the Mistral API is priced at $0.003 per minute ($0.18 per hour).
In a live demo, Voxtral Realtime transcribed fast speech containing technical jargon (e.g., Django and WebAssembly) within moments of each utterance.
The open-weights model in the release is Voxtral Realtime (Voxtral-Mini-4B-Realtime-2602) and it is available under an Apache-2.0 license.

Sections

Product Packaging And Access Modes (Open-Weights Vs Hosted)

The release is presented as a two-model family with a split between an open-weights realtime model under a permissive license and a closed-weights model available via a hosted API. The main delta is that a developer can choose between self-deployable weights and a managed endpoint within the same product line.

Mistral released Voxtral Transcribe 2, a family of two new audio-to-text transcription models, as a sequel to the original Voxtral from July 2025.
The open-weights model in the release is Voxtral Realtime (Voxtral-Mini-4B-Realtime-2602) and it is available under an Apache-2.0 license.
The closed-weights model is called voxtral-mini-latest and it is accessed via the Mistral API audio transcription endpoint.

Api-Level Controllability For Transcript Structure And Accuracy Shaping

The API exposes request-parameter controls for diarization, context biasing, and timestamp granularity, and the console playground returns multiple export formats (text, SRT, JSON). This indicates a focus on integration workflows where transcript structure and downstream usability matter, not just raw text output.

The Mistral transcription API supports diarization, context biasing, and segment-level timestamp granularities via request parameters.
The Mistral API console includes a speech-to-text playground that can upload audio and return diarized transcripts with downloads in text, SRT, or JSON.

Pricing Disclosure For Hosted Transcription

A concrete per-minute and per-hour price is provided for the hosted API transcription option. This is a high-signal delta because it enables immediate unit-cost modeling, even though the corpus does not specify additional fees, tiers, or limits.

Voxtral transcription via the Mistral API is priced at $0.003 per minute ($0.18 per hour).

Performance Expectations Based On A Demo Observation

The only performance-related evidence is a qualitative live demo report indicating low apparent latency and adequate handling of technical jargon. Because it is not a benchmark and lacks quantitative metrics, it updates expectations cautiously rather than establishing measured performance.

In a live demo, Voxtral Realtime transcribed fast speech containing technical jargon (e.g., Django and WebAssembly) within moments of each utterance.

Unknowns

What are the measured transcription accuracy metrics (e.g., word error rate) and latency for Voxtral Realtime and voxtral-mini-latest across realistic audio conditions (noise, accents, overlapping speech) and hardware configurations?
What operational constraints apply to the Mistral transcription API (rate limits, maximum file length, streaming vs batch behavior, concurrency caps, and uptime/SLA terms)?
Does the $0.003/min price change with enabled features such as diarization, timestamps, or context biasing, and are there separate charges for output formats or storage in the console workflow?
What are the resource requirements and real-time performance characteristics for self-hosting Voxtral Realtime (e.g., memory footprint, compute needs, and achievable throughput)?
What is the functional or quality difference between the open-weights Voxtral Realtime model and the hosted voxtral-mini-latest model (accuracy, latency, language coverage, supported features)?

Investor overlay

Read-throughs

Mistral is packaging speech to text as both open weights and hosted API, suggesting a land and expand path from self hosting to managed usage, with monetization via per minute pricing plus ecosystem adoption via Apache 2.0 release.
API controls for diarization, context biasing, and timestamps plus export formats suggest a push toward enterprise workflow integration, which could support higher retention and usage intensity if transcripts slot cleanly into downstream products.
A low apparent latency demo and an announced price point may be aimed at competing on real time capability and predictable unit economics, potentially pressuring or influencing pricing and feature expectations across transcription providers.

What would confirm

Published benchmarks for Voxtral Realtime and hosted voxtral mini latest including word error rate and latency across noisy audio, accents, and overlapping speech, plus hardware notes for self hosting throughput.
Clear operational terms for the API such as rate limits, maximum file length, streaming behavior, concurrency caps, and SLA uptime, enabling realistic deployment and spend forecasting.
Pricing clarity on whether diarization, timestamps, context biasing, exports, or console storage change the $0.003 per minute rate, and evidence of meaningful adoption such as growing developer usage or integrations.

What would kill

Accuracy or latency metrics that are materially uncompetitive in realistic conditions, especially for overlapping speech or accents, undermining the real time positioning implied by the demo.
Restrictive API constraints or unreliable uptime such as tight rate limits, low concurrency, short max durations, or lack of SLA that block production usage despite attractive pricing.
Meaningful price inflation or feature gating where common options like diarization or timestamps add significant cost, making unit economics less attractive than the headline $0.003 per minute.

Sources

Voxtral transcribes at the speed of sound

2026-02-04 simonwillison.net