Docs

Interactive Chat

Run local multi-turn chat sessions, search harnesses, and prompt caching.

Use trillim chat when you want a local multi-turn terminal conversation with a model.

If you installed with uv, prefix each command on this page with uv run.

Before You Start

  • Pull a model first with trillim pull Trillim/BitNet-TRNQ
  • If you plan to use Brave search, set SEARCH_API_KEY

Start a Session

trillim chat <model_dir>

model_dir can be a local path or a HuggingFace model ID. A typical first run looks like:

trillim pull Trillim/BitNet-TRNQ
trillim chat Trillim/BitNet-TRNQ

Session Behavior

Once the model loads, the prompt looks like:

Talk to BitNet-TRNQ (Ctrl+D or 'q' to quit, '/new' for new conversation)
>

Responses stream token by token:

> What is the capital of France?
The capital of France is Paris.
> Tell me more about it.
Paris is the largest city in France and serves as ...

Trillim keeps the full conversation history by default. When the conversation approaches the context limit, it automatically resets to the most recent message and continues.

Prompt caching is enabled for normal multi-turn usage, so follow-up turns are faster than the first turn.

Session Controls

CommandDescription
/newClear the conversation history and start fresh
qQuit the chat
Ctrl+DQuit the chat
Ctrl+CQuit the chat

Use a LoRA Adapter

Quantize the adapter first, then pass it to --lora:

trillim quantize <path-to-base-model> --adapter <path-to-adapter>
trillim chat Trillim/BitNet-TRNQ --lora <adapter-dir>

If the adapter tokenizer differs from the base model tokenizer, Trillim automatically uses the adapter tokenizer.

Use the Search Harness

The search harness is intended for models that emit <search>...</search> tags during generation.

Enable it with:

trillim chat <model_dir> --harness search

If the model is not search-tuned, stay on the default harness:

trillim chat <model_dir> --harness default

Providers for --search-provider:

  • ddgs for DuckDuckGo via ddgs
  • brave for Brave Search, which requires SEARCH_API_KEY
export SEARCH_API_KEY=<your_api_key>
trillim chat <model_dir> --harness search --search-provider brave

Typical status markers during search:

  • [Spin-Jump-Spinning...]
  • [Searching: <query>]
  • [Synthesizing...]
  • [Search unavailable]

The harness allows up to 2 search rounds before the final streamed answer.

To inspect intermediate generations and fetched search context, set SearchHarness.DEBUG = True in src/trillim/harnesses/_search.py.

Sampling and Performance

trillim chat uses the model’s default sampling parameters. The relevant controls are:

ParameterDescription
temperatureLower values make output more deterministic
top_kLimits sampling to the top K tokens
top_pNucleus sampling threshold
repetition_penaltyReduces repeated tokens

If you need to override those values per request, start trillim serve and send them to POST /v1/chat/completions.

Performance tips:

  • Thread count defaults to num_cores - 2; override it with --threads N
  • --lora-quant and --unembed-quant trade quality for lower memory use
  • Use /new when long conversations start to slow down