Shareable visual notes for the running Qwen3.6-35B-A3B homelab setup.
Captured 2026-04-26 22:34 CDT. Screencast framing: Qwen3.6-35B-A3B serving roughly 25 tok/s over a 151k+ token working set at under 100W.
llama-server -ngl 999 -fa on --no-mmap -t 16 -tb 32 \
-m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
-c 524288 -np 3 --kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
-b 4096 -ub 4096 --cache-ram 16384 \
--cache-idle-slots --slot-prompt-similarity 0.8 \
--mlock --reasoning on --reasoning-budget 65536
The recording shows the live llama-swap activity view, GPU power draw, and terminal output while the Strix Halo node serves the long-context run.
SSD → RAM → GPU model loading pipeline. Watch the difference between loading all weights at once vs loading on-demand via memory mapping.
How BF16 weights are compressed to int4. Compare single-scale symmetric quantization against per-group asymmetric scaling with separate min/max.
256 weights organized into 8 groups of 32. Some groups use 4-bit (16 levels), others use 6-bit (64 levels). Hierarchical scaling preserves local outliers.
Find the most important weights by activation magnitude. These salient weights are protected with higher precision during quantization to preserve model quality.