feat: integrate dynamic speculative decoding profiling mode and Qwen 3.6 35B matrix by solderzzc · Pull Request #55 · SharpAI/SwiftLM

solderzzc · 2026-04-16T18:11:06Z

Integrates the Qwen 3.6 35B deep dive benchmark matrix natively into the SwiftLM README and auto-wires the background speculative decoding validation mode directly into the automated profile evaluation suite to capture multi-model draft performance.

…3.6 35B matrix

Copilot

Pull request overview

Integrates a new “dynamic speculative decoding” profiling configuration into the existing automated profiling runner, and documents benchmark results for the newly added mlx-community/Qwen3.6-35B-A3B-4bit matrix in repo docs/README.

Changes:

Add draft-model auto-selection + auto-injected speculative decoding config to scripts/profiling/profile_runner.py.
Add mlx-community/Qwen3.6-35B-A3B-4bit to the interactive benchmark model list in run_benchmark.sh.
Publish Qwen3.6-35B benchmark tables in README.md and the profiling results markdown.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
scripts/profiling/profile_runner.py	Auto-injects a speculative decoding config by selecting a draft model based on the main model string.
run_benchmark.sh	Adds Qwen3.6-35B to the selectable benchmark models list and removes a duplicate entry.
docs/profiling/profiling_results_simbas-MacBook-Pro.md	Adds a new Qwen3.6-35B context/memory profiling section.
README.md	Adds a new Qwen3.6-35B performance section and headline benchmark table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Speculative Decoding Mode Auto-Injection
+    draft_model = get_draft_model(args.model)
+    if draft_model:
+        draft_name = draft_model.split("/")[-1].split("-")[1] # extract roughly '0.8B' or '1b'
+        CONFIGS.append({"name": f"TurboQuant + Speculative ({draft_name})", "flags": ["--turbo-kv", "--draft-model", draft_model]})
+


+    if draft_model:
+        draft_name = draft_model.split("/")[-1].split("-")[1] # extract roughly '0.8B' or '1b'
+        CONFIGS.append({"name": f"TurboQuant + Speculative ({draft_name})", "flags": ["--turbo-kv", "--draft-model", draft_model]})


+| Configuration | Context Size | TTFT | Generation Speed | Model Size | Active RAM (Physical) | GPU Memory Allocated |
+|---|---|---|---|---|---|---|
+| Dense/Vanilla | 512 | 4.01s | 32.10 tok/s | N/A | 18.9 GB | 33.6 GB |
+| Dense/Vanilla | 40000 | 26.41s | 23.99 tok/s | N/A | 49.4 GB | 64.2 GB |
+| Dense/Vanilla | 100000 | 151.76s | 18.64 tok/s | N/A | 49.3 GB | 63.9 GB |
+| SSD Stream | 512 | 1.81s | 15.01 tok/s | N/A | 4.5 GB | 18.8 GB |
+| SSD Stream | 40000 | 28.89s | 5.13 tok/s | N/A | 37.4 GB | 51.7 GB |
+| SSD Stream | 100000 | 100.72s | 4.08 tok/s | N/A | 49.4 GB | 63.9 GB |
+| TurboQuant | 512 | 0.44s | 33.14 tok/s | N/A | 18.9 GB | 33.3 GB |
+| TurboQuant | 40000 | 20.90s | 2.54 tok/s | N/A | 22.7 GB | 37.0 GB |
+| TurboQuant | 100000 | 60.30s | 4.73 tok/s | N/A | 27.7 GB | 42.0 GB |
+| SSD + TurboQuant | 512 | 1.64s | 14.51 tok/s | N/A | 4.5 GB | 19.3 GB |
+| SSD + TurboQuant | 40000 | 27.56s | 5.39 tok/s | N/A | 8.5 GB | 23.2 GB |
+| SSD + TurboQuant | 100000 | 75.59s | 3.86 tok/s | N/A | 13.6 GB | 28.3 GB |
+| SSD + 16-Worker Prefetch | 512 | 0.94s | 16.70 tok/s | N/A | 4.5 GB | 19.4 GB |
+| SSD + 16-Worker Prefetch | 40000 | 28.88s | 5.17 tok/s | N/A | 37.4 GB | 51.9 GB |
+| SSD + 16-Worker Prefetch | 100000 | 101.96s | 3.79 tok/s | N/A | 49.4 GB | 63.9 GB |


+    if "gemma" in m:
+        return "mlx-community/gemma-3-1b-it-4bit"


+    draft_model = get_draft_model(args.model)
+    if draft_model:
+        draft_name = draft_model.split("/")[-1].split("-")[1] # extract roughly '0.8B' or '1b'
+        CONFIGS.append({"name": f"TurboQuant + Speculative ({draft_name})", "flags": ["--turbo-kv", "--draft-model", draft_model]})


solderzzc and others added 2 commits April 16, 2026 11:09

feat: integrate dynamic speculative decoding profiling mode and Qwen …

eb2180f

…3.6 35B matrix

Merge branch 'main' into feature/qwen-profiling-metrics

1fba3ae

Copilot AI review requested due to automatic review settings April 26, 2026 07:06

Copilot started reviewing on behalf of solderzzc April 26, 2026 07:06 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate dynamic speculative decoding profiling mode and Qwen 3.6 35B matrix#55

feat: integrate dynamic speculative decoding profiling mode and Qwen 3.6 35B matrix#55
solderzzc wants to merge 2 commits intomainfrom
feature/qwen-profiling-metrics

solderzzc commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

solderzzc commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants