Docs
API Server
OpenAI-compatible server endpoints, streaming, and voice setup.
Use trillim serve when you want local OpenAI-compatible HTTP endpoints.
If you installed with uv, prefix each command on this page with uv run.
Before You Start
- Pull a model first with
trillim pull Trillim/BitNet-TRNQ - If you want voice endpoints, install
uv add "trillim[voice]"orpip install "trillim[voice]" - If you want Brave search, set
SEARCH_API_KEY - If you want custom voice registration through
POST /v1/voices, accept the terms for kyutai/pocket-tts, create a HuggingFace token withReadaccess, and runhf auth loginonce
Start the Server
Start the default server:
trillim serve Trillim/BitNet-TRNQ
By default the server binds to 127.0.0.1:8000. Override host and port with:
trillim serve Trillim/BitNet-TRNQ --host 0.0.0.0 --port 3000
Enable speech-to-text and text-to-speech with:
trillim serve Trillim/BitNet-TRNQ --voice
Change the Loaded Model or Harness
trillim serve starts with the default harness. Use POST /v1/models/load to swap models, load a LoRA adapter, or switch to the search harness without restarting the server.
Switch models:
curl http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_dir": "Trillim/BitNet-TRNQ"
}'
Enable the search harness on a running server:
curl http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_dir": "Trillim/BitNet-Search-TRNQ",
"harness": "search",
"search_provider": "ddgs"
}'
If you use "search_provider": "brave", set:
export SEARCH_API_KEY=<your_api_key>
Endpoint Summary
| Endpoint | Purpose | Notes |
|---|---|---|
POST /v1/chat/completions | Chat completions | Streaming supported |
POST /v1/completions | Raw text completions | Streaming supported |
GET /v1/models | Show the loaded model | Always available |
POST /v1/models/load | Swap model, adapter, or harness | Always available |
POST /v1/audio/transcriptions | Speech-to-text | Requires --voice |
POST /v1/audio/speech | Text-to-speech | Requires --voice |
GET /v1/voices | List voices | Requires --voice |
POST /v1/voices | Register a custom voice | Requires --voice |
DELETE /v1/voices/{voice_id} | Delete a custom voice | Requires --voice |
POST /v1/chat/completions
Send a conversation and get a model response. This endpoint is compatible with the OpenAI chat completions API.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Request body:
| Field | Type | Default | Description |
|---|---|---|---|
messages | array | required | List of {"role": "...", "content": "..."} objects |
model | string | "" | Model identifier for client compatibility |
temperature | float | model default | Sampling temperature, >= 0 |
top_k | int | model default | Top-K sampling, >= 1 |
top_p | float | model default | Nucleus sampling threshold, (0, 1] |
max_tokens | int | null | Maximum tokens to generate |
repetition_penalty | float | model default | Repetition penalty, >= 0 |
stream | bool | false | Enable server-sent event streaming |
When the active harness is search, this endpoint can run multi-step search-augmented generation for search-tuned models.
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "BitNet-TRNQ",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20,
"cached_tokens": 0
}
}
Streaming
Set "stream": true to receive server-sent events:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"BitNet-TRNQ","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
With the search harness, streamed deltas may include markers such as [Searching: ...] and [Synthesizing...] before the final answer text.
POST /v1/completions
Send a raw prompt without a chat template.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The quick brown fox"
}'
This endpoint uses the same sampling fields as /v1/chat/completions, but replaces messages with prompt and returns text instead of message.
Streaming is supported with "stream": true.
/v1/completions does not use chat harness orchestration.
GET /v1/models
Return the currently loaded model:
curl http://localhost:8000/v1/models
{
"object": "list",
"data": [
{"id": "BitNet-TRNQ", "object": "model", "created": 0, "owned_by": "local"}
]
}
POST /v1/models/load
Swap to a different model, LoRA adapter, or harness configuration at runtime. Models must already exist under ~/.trillim/models/.
curl http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_dir": "Trillim/BitNet-TRNQ"
}'
Request body:
| Field | Type | Default | Description |
|---|---|---|---|
model_dir | string | required | HuggingFace model ID or path under ~/.trillim/models/ |
adapter_dir | string | null | LoRA adapter directory |
harness | string | null | default or search |
search_provider | string | null | ddgs or brave when harness is search |
threads | int | null | null keeps the current setting, 0 auto-detects |
lora_quant | string | null | LoRA quantization level |
unembed_quant | string | null | Unembed quantization level |
Response:
{
"status": "success",
"model": "BitNet-TRNQ",
"recompiled": false,
"message": ""
}
Voice Endpoints
These routes are only available when the server starts with --voice.
POST /v1/audio/transcriptions
Speech-to-text using Whisper. Upload an audio file as multipart form data. The maximum file size is 8 MB.
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@recording.wav \
-F model=whisper-1
Form fields:
| Field | Type | Default | Description |
|---|---|---|---|
file | file | required | Audio file to transcribe |
model | string | "whisper-1" | Model identifier for client compatibility |
language | string | null | Optional language hint such as "en" |
response_format | string | "json" | "json" or "text" |
Response:
{
"text": "Hello, how are you?"
}
POST /v1/audio/speech
Convert text to speech. The response is a WAV or PCM audio stream.
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, world!", "voice": "alba"}' \
--output speech.wav
Request body:
| Field | Type | Default | Description |
|---|---|---|---|
input | string | required | Text to synthesize |
voice | string | "alba" | Voice ID |
response_format | string | "wav" | "wav" or "pcm" |
GET /v1/voices
List all available voices, including the built-in set and any saved custom voices.
curl http://localhost:8000/v1/voices
Built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma
POST /v1/voices
Upload an audio sample to register a custom voice. The sample is saved to the configured voices directory and persists across server restarts.
curl http://localhost:8000/v1/voices \
-F voice_id=my-voice \
-F file=@sample.wav
Form fields:
| Field | Type | Description |
|---|---|---|
voice_id | string | Identifier for the new voice |
file | file | Audio sample, max 8 MB |
If custom voice registration fails, verify that you completed the HuggingFace setup in the prerequisites at the top of this page.
DELETE /v1/voices/{voice_id}
Delete a previously registered custom voice. Built-in voices cannot be deleted.
curl -X DELETE http://localhost:8000/v1/voices/my-voice
OpenAI Python Client
You can use the official OpenAI client library against the local server:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="BitNet-TRNQ",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Streaming works too:
stream = client.chat.completions.create(
model="BitNet-TRNQ",
messages=[{"role": "user", "content": "Tell me a story."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)