Trillim's Tokens
Benchmarking DarkNet against bitnet.cpp
March 9, 2026
A first engineering write-up on how DarkNet, the inference engine inside Trillim, compares with bitnet.cpp on consumer CPUs.
DarkNet is the inference engine built from the ground up that powers Trillim. Today, Trillim’s CPU inference path is centered on ternary BitNet checkpoints, including Llama-derived LlamaForCausalLM variants distilled into ternary form, so Microsoft’s bitnet.cpp is the baseline people will compare against. This post is meant to answer the practical question: where does DarkNet actually move the needle on a real local setup?
What we wanted to measure
- Is DarkNet materially better for prefill on consumer CPUs?
- Is DarkNet’s decode actually better or is it just noise?
- How does runtime quantization compare across common settings?
Benchmark setup
The runs behind this comparison used a repeatable process instead of one-off screenshots:
- fresh system restart before benchmark sessions
- five warmup runs for both engines before recording results
- interleaved execution between engines to reduce time-drift bias
- cooldowns between runs until CPU temperature returned to
45C
That does not make the results universal. It just makes them less sloppy. However, we still noticed significant variance between runs and this can be attributed to the following:
- Starting each run at
45Cdoesn’t mean the CPU is in the same performance state (there can be system throttling) - Background tasks run even after a fresh restart and the scheduler can affect engine performance
- The memory system reacts differently and is much harder to consistently control
Benchmark results
These are the raw benchmark snapshots from two repeated passes on the same 12th Gen Intel i7-1255U setup. The decode charts isolate decode throughput at 10 threads, while the runtime-quantization tables show the broader comparison across thread counts for prefill throughput, decode throughput, energy per token, and watt draw.
Decode snapshots
q4_0 and q8_0, is effectively tied at q5_0, and still trails at q6_k. It is a good snapshot of why the decode story is encouraging, but not uniform.
q4_0, but q5_0 and q6_k tilt back toward bitnet.cpp, while q8_0 is basically even. That spread is the variance we kept seeing in shorter decode runs.
Runtime quantization snapshots
4 threads onward, prefill opens up, and by 10 threads DarkNet is ahead on both prefill and decode.
4 threads, while decode becomes much less dramatic and only barely finishes ahead at 10 threads.
q5_0 useful as a sanity check that the runtime is not winning everywhere just because one chart looked good.
bitnet.cpp keeps the decode edge throughout. This is why the overall conclusion stays measured instead of pretending decode is solved.
q6_k one of the more useful counterweights in the set.
4 threads upward and decode also turns positive at the higher thread counts. It is one of the best cases for the claim that DarkNet can separate when the workload gets more realistic.
q8_0 representative of the broader pattern: the upside is real, but short-run decode still moves around enough that we should keep describing it carefully.
What stood out
Two patterns were consistent enough to matter:
- Prefill improvements became much clearer once
num_threads >= 4and were consistently over10%. - Decode throughput stayed broadly comparable on average, while DarkNet reached higher peaks.
That is the useful framing for now. The claim is not that every chart is a blowout. The claim is that DarkNet is meaningfully competitive already and starts to separate more clearly in the thread counts people actually care about.
The quick takeaway is that prefill is where DarkNet looks strongest right now. Decode is closer, but still encouraging because the average stays competitive and the ceiling is higher. The full chart dump above is there on purpose so people can inspect the less flattering cases instead of only the best screenshots.
Caveats
These numbers should still be treated as directional:
- consumer CPUs behave differently under thermal and boost pressure
- background scheduling noise can distort short runs
- memory bandwidth and cache behavior matter more as thread count climbs
- compiler flags, microcode, and kernel differences can move results
That is exactly why the methodology matters. If we are going to make performance claims, they need to survive repeat runs and not just a lucky chart.
Where this goes next
There is still headroom left in the runtime. We have already identified another round of improvements that should move prefill by roughly 10% and decode by roughly 5%, and we have also found ARM-specific bottlenecks that are worth addressing separately instead of treating x86 results as the whole story.
We have also added AVX-VNNI support internally, which improves prefill performance by roughly 30% on supported CPUs. That will matter more once it lands in a public release, because it changes the ceiling again for the hardware that can use it.
If you have ideas for improving the benchmark process, send them over on X or by email.