llama.cpp b9180 adds MTP speculative decoding support
ggml-org's May 16 llama.cpp b9180 release adds MTP-based speculative decoding support and updates rollback handling for some recurrent-state paths. Performance impact remains deployment-specific and has not been independently reproduced in the cited sources.
llama.cpp release b9180, published by ggml-org on May 16, adds support for Multi-Token Prediction (MTP) in speculative decoding.
The ggml-org release notes center the release on PR #22673 and related changes. They say b9180 adds MTP support, standardizes model identification around mtp-, uses draft-mtp naming for the draft path, and updates conversion and compatibility handling.
The release also changes rollback and checkpoint behavior for some recurrent-state paths. The notes describe previous speculative checkpointing as restarting from a checkpoint after rejected draft tokens. b9180 adds partial rollback support for GDN models by storing intermediates, with Metal and Vulkan work referenced in the release package.
For local and edge inference operators using llama.cpp, the change is bounded. MTP is now a supported speculative decoding path for models with MTP heads. In some supported recurrent-state scenarios, rejected-token cases may avoid wasted recomputation.
The caveat remains material. Performance claims around the release come from maintainer or contributor benchmarks and discussion, not independent reproduction in the cited sources. Real impact will vary by model architecture, backend, hardware, memory budget, quantization, draft-token count, and speculative decoding acceptance behavior.
PR #22673 also lists remaining work after merge, including MTP compatibility with ngram drafting, recurrent-state CI tests, --spec-draft-p-min behavior for MTP, larger batch and recurrent-state sequence cases, multi-sequence recurrent-memory performance, embedding transfer overhead, and continuing Metal drafting improvements.
Region: Global / open-source ecosystem.
Primary desk: Open Source & Local AI.
Confidence: Good for the release event, date, and technical change list based on primary GitHub sources. Performance and operator-impact framing should remain caveated.
Sources
Primary sources:
- ggml-org / llama.cpp release b9180: https://github.com/ggml-org/llama.cpp/releases/tag/b9180
- PR #22673, MTP support: https://github.com/ggml-org/llama.cpp/pull/22673
- PR #22400: https://github.com/ggml-org/llama.cpp/pull/22400
- Referenced commit on partial rollback / Vulkan path: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc
Known uncertainties / caveats
- Performance claims are maintainer or contributor-provided and have not been independently reproduced in the cited sources.
- Deployment impact depends on model architecture, backend, hardware, memory budget, quantization, draft-token count, and acceptance behavior.
- The PR lists remaining TODOs after merge; the release should not be framed as universal speedup or production readiness.
- No France/EU-specific angle is supported by the source record.