fix(server): prompt-cache bleed fixes — MambaCache gate + ndim guard + spec-decode ordering#85
fix(server): prompt-cache bleed fixes — MambaCache gate + ndim guard + spec-decode ordering#85ericjlake wants to merge 3 commits intoSharpAI:mainfrom
Conversation
Three fixes, now riding on upstream 116ee91: 1. save(): slice KVCacheSimple state T-dim down to P=tokens.count so the cached states' T matches cached.tokens.count. Prevents the over-allocated prefill buffer from carrying uninitialized tokens past the valid prefix. 2. restore(): gate out recurrent-state layers (MambaCache and friends) up front. Their state is 2-D with no T dimension, so the dim(2) read in the pre-flight check would crash; also there's no trim(excess) operator for a recurrent hidden state — we can't partial-restore one safely. Guard with ndim>=3 inside the min-length scan too for belt-and-suspenders. 3. handleChatCompletion(): reorder the decision branch so speculative decoding is checked BEFORE the prompt cache restore. A cache-hit rollback corrupts the draft model's KV state (draft and main cycle tokens in lock-step), so when draftModelRef is set we bypass the cache entirely and pay the full prefill. Partial-match restores stay available on the non-spec path where they still pay off.
Adds a new Performance subsection covering full-RAM Qwen3.6-35B-A3B-UD-MLX-4bit inference on M1 Ultra 64 GB: - Vanilla full-GPU (62 tok/s) — post needsMoeFlush gate (SwiftLM SharpAI#84) - DFlash spec decode with z-lab/Qwen3.6-35B-A3B-DFlash (+13% medium/long, -15% short due to block overhead, finish_reason behavior changes) Includes 19→62 tok/s before/after reference for the gate fix.
|
Hey Eric — this has merge conflicts against current We're going to merge Also — the companion PR (mlx-swift-lm#34) is merged! ✅ |
Merges ericjlake's prompt-cache fixes from PR SharpAI#85, resolving conflicts with the DFlash integration (PR SharpAI#78). Changes from ericjlake: - MambaCache safety gate + KVCacheSimple T-dim slice in save() - ndim >= 3 guard in minCachedSeqLen scan - Spec-decode short-circuit ordering (check before cache restore) - README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB) Conflict resolution: - README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables - Server.swift save(): kept existing MambaCache early return + new T-dim slice - Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits) Closes SharpAI#84. Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>
|
Update: we tried to push the conflict resolution directly to your Instead, we sent the resolution as a PR to your fork: ericjlake#1 Once you merge that, this PR will be conflict-free and we can land it here. All your original commits are preserved — just one additional merge commit on top. |
Closes (part of) #84.
This PR is one of two paired patches landing the work from #84. Companion PR in
mlx-swift-lmadds theneedsMoeFlushgate that produces the headline 18 → 63 tok/s speedup on full-RAM Qwen3-A3B; this PR fixes three correctness bugs in SwiftLM's prompt-cache path that were silently regressing throughput and quality on chat-template replays. Bonus: a small README perf table addition (per @solderzzc's request in #84).What changed
Sources/SwiftLM/Server.swift— three prompt-cache bleed fixes1. MambaCache safety gate in
save()Adds an early return when any layer in the cache is a
MambaCache. Mamba's recurrent state can't be partially trimmed (unlike attention's offset-decrement), so the cache cannot be safely saved/restored for hybrid Attention+Mamba models. Without this guard,cache.trim(N)on a MambaCache layer hits an unrelated assertion path. Same direction as upstream5553bf5(disable prompt cache for MambaCache hybrid models) — this is the explicit guard at the save site.2. KVCacheSimple T-dim slice in
save()For attention
KVCacheSimplelayers, the state tensor is[B, H, T, D]with a pre-allocatedTthat can exceed the actual prompt lengthP. If we store the full over-sized buffer,restore()'strim(cached.tokens.count - matchLen)still leavesT - Pslots of garbage beyond the valid prefix. We now sliceTdown toPat save time socached.tokens.count === cached state's T.3.
ndim >= 3guard insideminCachedSeqLenscanWhen no caches are yet populated, the min-search returned 0 and the restore path short-circuited the prefill, producing a "1-token-then-EOS" pattern on warm replays. The guard fails closed: if the state tensor doesn't have the expected rank, skip it in the min calculation rather than silently returning a degenerate result.
4. Spec-decode short-circuit ordering in
process()The
hasDraftModelshort-circuit now runs before the prompt-cache restore decision. Reason: a partial-match cache restore corrupts the draft model's KV state, since draft and main cycle tokens in lock-step. Better to pay the full prefill than emit garbage. This is independent of #72's auto-cap (which addresses I/O fan-out for--stream-experts + --draft-model); this change is about correctness on the in-RAM spec-decode path.README.md— Performance subsection for full-RAM Qwen3-A3BAdds a new subsection to the Performance section with reproducible numbers across Vanilla and DFlash-spec-decode configurations on M1 Ultra 64 GB, mirroring the existing M5 Pro Gemma-4 table style. Per @solderzzc's suggestion in #84 — happy to iterate on layout/content.
Hardware / repro
Qwen3.6-35B-A3B-UD-MLX-4bit(full-GPU strategy, no--stream-experts)needsMoeFlushgate + prompt-cache bleed fixes #84 for the full diagnosis and benchmark methodology.Test plan
Qwen3.6-35B-A3B-UD-MLX-4biton M1 Ultra 64 GB — 3.4× steady-state generation speedup (18 → 63 tok/s) with companionmlx-swift-lmPR applied.--draft-modelis set.Companion PR
SharpAI/mlx-swift-lmPR with theneedsMoeFlushgate: SharpAI/mlx-swift-lm#34