Trillim's Tokens

Benchmarking DarkNet against bitnet.cpp

March 9, 2026

A first engineering write-up on how DarkNet, the inference engine inside Trillim, compares with bitnet.cpp on consumer CPUs.

DarkNet is the inference engine built from the ground up that powers Trillim. Today, Trillim’s CPU inference path is centered on ternary BitNet checkpoints, including Llama-derived LlamaForCausalLM variants distilled into ternary form, so Microsoft’s bitnet.cpp is the baseline people will compare against. This post is meant to answer the practical question: where does DarkNet actually move the needle on a real local setup?

What we wanted to measure

  • Is DarkNet materially better for prefill on consumer CPUs?
  • Is DarkNet’s decode actually better or is it just noise?
  • How does runtime quantization compare across common settings?

Benchmark setup

The runs behind this comparison used a repeatable process instead of one-off screenshots:

  • fresh system restart before benchmark sessions
  • five warmup runs for both engines before recording results
  • interleaved execution between engines to reduce time-drift bias
  • cooldowns between runs until CPU temperature returned to 45C

That does not make the results universal. It just makes them less sloppy. However, we still noticed significant variance between runs and this can be attributed to the following:

  • Starting each run at 45C doesn’t mean the CPU is in the same performance state (there can be system throttling)
  • Background tasks run even after a fresh restart and the scheduler can affect engine performance
  • The memory system reacts differently and is much harder to consistently control

Benchmark results

These are the raw benchmark snapshots from two repeated passes on the same 12th Gen Intel i7-1255U setup. The decode charts isolate decode throughput at 10 threads, while the runtime-quantization tables show the broader comparison across thread counts for prefill throughput, decode throughput, energy per token, and watt draw.

Decode snapshots

Run A decode throughput comparison across q4_0, q5_0, q6_k, and q8_0 at 10 threads.
Decode run A. This is the more favorable decode pass for DarkNet: it leads clearly at q4_0 and q8_0, is effectively tied at q5_0, and still trails at q6_k. It is a good snapshot of why the decode story is encouraging, but not uniform.
Run B decode throughput comparison across q4_0, q5_0, q6_k, and q8_0 at 10 threads.
Decode run B. The second pass is tighter and less flattering: DarkNet still holds q4_0, but q5_0 and q6_k tilt back toward bitnet.cpp, while q8_0 is basically even. That spread is the variance we kept seeing in shorter decode runs.

Runtime quantization snapshots

Run A q4_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q4_0, run A. This is the cleanest example of DarkNet's prefill advantage showing up once thread counts rise. From 4 threads onward, prefill opens up, and by 10 threads DarkNet is ahead on both prefill and decode.
Run B q4_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q4_0, run B. The same overall pattern holds on the repeat pass: prefill is still materially ahead for DarkNet above 4 threads, while decode becomes much less dramatic and only barely finishes ahead at 10 threads.
Run A q5_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q5_0, run A. Prefill still trends strongly in DarkNet's favor once the CPU is busy, but decode mostly compresses toward parity. That makes q5_0 useful as a sanity check that the runtime is not winning everywhere just because one chart looked good.
Run B q5_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q5_0, run B. The repeat pass is slightly harsher on decode and still consistent on prefill. This is one of the clearer examples that the prefill improvement is more stable than the decode uplift.
Run A q6_k benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q6_k, run A. Here the split is obvious: DarkNet owns prefill from mid-to-high thread counts, but bitnet.cpp keeps the decode edge throughout. This is why the overall conclusion stays measured instead of pretending decode is solved.
Run B q6_k benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q6_k, run B. The repeat pass tells the same story again. Prefill remains strong for DarkNet at higher thread counts, but decode never flips. That consistency makes q6_k one of the more useful counterweights in the set.
Run A q8_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q8_0, run A. This is the strongest balanced chart for DarkNet: prefill is ahead from 4 threads upward and decode also turns positive at the higher thread counts. It is one of the best cases for the claim that DarkNet can separate when the workload gets more realistic.
Run B q8_0 benchmark table comparing DarkNet and BitNet.cpp across threads, throughput, energy, and watt draw.
q8_0, run B. The second pass keeps the prefill lead but pulls decode back toward even or slightly negative. That makes q8_0 representative of the broader pattern: the upside is real, but short-run decode still moves around enough that we should keep describing it carefully.

What stood out

Two patterns were consistent enough to matter:

  • Prefill improvements became much clearer once num_threads >= 4 and were consistently over 10%.
  • Decode throughput stayed broadly comparable on average, while DarkNet reached higher peaks.

That is the useful framing for now. The claim is not that every chart is a blowout. The claim is that DarkNet is meaningfully competitive already and starts to separate more clearly in the thread counts people actually care about.

The quick takeaway is that prefill is where DarkNet looks strongest right now. Decode is closer, but still encouraging because the average stays competitive and the ceiling is higher. The full chart dump above is there on purpose so people can inspect the less flattering cases instead of only the best screenshots.

Caveats

These numbers should still be treated as directional:

  • consumer CPUs behave differently under thermal and boost pressure
  • background scheduling noise can distort short runs
  • memory bandwidth and cache behavior matter more as thread count climbs
  • compiler flags, microcode, and kernel differences can move results

That is exactly why the methodology matters. If we are going to make performance claims, they need to survive repeat runs and not just a lucky chart.

Where this goes next

There is still headroom left in the runtime. We have already identified another round of improvements that should move prefill by roughly 10% and decode by roughly 5%, and we have also found ARM-specific bottlenecks that are worth addressing separately instead of treating x86 results as the whole story.

We have also added AVX-VNNI support internally, which improves prefill performance by roughly 30% on supported CPUs. That will matter more once it lands in a public release, because it changes the ceiling again for the hardware that can use it.

If you have ideas for improving the benchmark process, send them over on X or by email.