Docs

API Server

OpenAI-compatible server endpoints, streaming, and voice setup.

Use trillim serve when you want local OpenAI-compatible HTTP endpoints.

If you installed with uv, prefix each command on this page with uv run.

Before You Start

  • Pull a model first with trillim pull Trillim/BitNet-TRNQ
  • If you want voice endpoints, install uv add "trillim[voice]" or pip install "trillim[voice]"
  • If you want Brave search, set SEARCH_API_KEY
  • If you want custom voice registration through POST /v1/voices, accept the terms for kyutai/pocket-tts, create a HuggingFace token with Read access, and run hf auth login once

Start the Server

Start the default server:

trillim serve Trillim/BitNet-TRNQ

By default the server binds to 127.0.0.1:8000. Override host and port with:

trillim serve Trillim/BitNet-TRNQ --host 0.0.0.0 --port 3000

Enable speech-to-text and text-to-speech with:

trillim serve Trillim/BitNet-TRNQ --voice

Change the Loaded Model or Harness

trillim serve starts with the default harness. Use POST /v1/models/load to swap models, load a LoRA adapter, or switch to the search harness without restarting the server.

Switch models:

curl http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_dir": "Trillim/BitNet-TRNQ"
  }'

Enable the search harness on a running server:

curl http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_dir": "Trillim/BitNet-Search-TRNQ",
    "harness": "search",
    "search_provider": "ddgs"
  }'

If you use "search_provider": "brave", set:

export SEARCH_API_KEY=<your_api_key>

Endpoint Summary

EndpointPurposeNotes
POST /v1/chat/completionsChat completionsStreaming supported
POST /v1/completionsRaw text completionsStreaming supported
GET /v1/modelsShow the loaded modelAlways available
POST /v1/models/loadSwap model, adapter, or harnessAlways available
POST /v1/audio/transcriptionsSpeech-to-textRequires --voice
POST /v1/audio/speechText-to-speechRequires --voice
GET /v1/voicesList voicesRequires --voice
POST /v1/voicesRegister a custom voiceRequires --voice
DELETE /v1/voices/{voice_id}Delete a custom voiceRequires --voice

POST /v1/chat/completions

Send a conversation and get a model response. This endpoint is compatible with the OpenAI chat completions API.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Request body:

FieldTypeDefaultDescription
messagesarrayrequiredList of {"role": "...", "content": "..."} objects
modelstring""Model identifier for client compatibility
temperaturefloatmodel defaultSampling temperature, >= 0
top_kintmodel defaultTop-K sampling, >= 1
top_pfloatmodel defaultNucleus sampling threshold, (0, 1]
max_tokensintnullMaximum tokens to generate
repetition_penaltyfloatmodel defaultRepetition penalty, >= 0
streamboolfalseEnable server-sent event streaming

When the active harness is search, this endpoint can run multi-step search-augmented generation for search-tuned models.

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "BitNet-TRNQ",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "The capital of France is Paris."},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20,
    "cached_tokens": 0
  }
}

Streaming

Set "stream": true to receive server-sent events:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

With the search harness, streamed deltas may include markers such as [Searching: ...] and [Synthesizing...] before the final answer text.

POST /v1/completions

Send a raw prompt without a chat template.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The quick brown fox"
  }'

This endpoint uses the same sampling fields as /v1/chat/completions, but replaces messages with prompt and returns text instead of message.

Streaming is supported with "stream": true.

/v1/completions does not use chat harness orchestration.

GET /v1/models

Return the currently loaded model:

curl http://localhost:8000/v1/models
{
  "object": "list",
  "data": [
    {"id": "BitNet-TRNQ", "object": "model", "created": 0, "owned_by": "local"}
  ]
}

POST /v1/models/load

Swap to a different model, LoRA adapter, or harness configuration at runtime. Models must already exist under ~/.trillim/models/.

curl http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{
    "model_dir": "Trillim/BitNet-TRNQ"
  }'

Request body:

FieldTypeDefaultDescription
model_dirstringrequiredHuggingFace model ID or path under ~/.trillim/models/
adapter_dirstringnullLoRA adapter directory
harnessstringnulldefault or search
search_providerstringnullddgs or brave when harness is search
threadsintnullnull keeps the current setting, 0 auto-detects
lora_quantstringnullLoRA quantization level
unembed_quantstringnullUnembed quantization level

Response:

{
  "status": "success",
  "model": "BitNet-TRNQ",
  "recompiled": false,
  "message": ""
}

Voice Endpoints

These routes are only available when the server starts with --voice.

POST /v1/audio/transcriptions

Speech-to-text using Whisper. Upload an audio file as multipart form data. The maximum file size is 8 MB.

curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@recording.wav \
  -F model=whisper-1

Form fields:

FieldTypeDefaultDescription
filefilerequiredAudio file to transcribe
modelstring"whisper-1"Model identifier for client compatibility
languagestringnullOptional language hint such as "en"
response_formatstring"json""json" or "text"

Response:

{
  "text": "Hello, how are you?"
}

POST /v1/audio/speech

Convert text to speech. The response is a WAV or PCM audio stream.

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, world!", "voice": "alba"}' \
  --output speech.wav

Request body:

FieldTypeDefaultDescription
inputstringrequiredText to synthesize
voicestring"alba"Voice ID
response_formatstring"wav""wav" or "pcm"

GET /v1/voices

List all available voices, including the built-in set and any saved custom voices.

curl http://localhost:8000/v1/voices

Built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

POST /v1/voices

Upload an audio sample to register a custom voice. The sample is saved to the configured voices directory and persists across server restarts.

curl http://localhost:8000/v1/voices \
  -F voice_id=my-voice \
  -F file=@sample.wav

Form fields:

FieldTypeDescription
voice_idstringIdentifier for the new voice
filefileAudio sample, max 8 MB

If custom voice registration fails, verify that you completed the HuggingFace setup in the prerequisites at the top of this page.

DELETE /v1/voices/{voice_id}

Delete a previously registered custom voice. Built-in voices cannot be deleted.

curl -X DELETE http://localhost:8000/v1/voices/my-voice

OpenAI Python Client

You can use the official OpenAI client library against the local server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Streaming works too:

stream = client.chat.completions.create(
    model="BitNet-TRNQ",
    messages=[{"role": "user", "content": "Tell me a story."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)