Generate
Send a messages array, get tokens back as they're produced, terminated by one done frame. No retry, no fallback, no rewriting prompts — the daemon stays out of your way.
A technical note on local inference plumbing
inferd is a single host-wide local-inference daemon. One generate model, one embed model, held warm in memory and shared by every tool on the machine that wants them — over a small NDJSON-over-IPC protocol. It is plumbing, not a product.
$ inferdctl status daemon ready backend llamacpp · gemma-4-e4b · cpu embed llamacpp · embeddinggemma-300m · 768→256 MRL queue 0 active · 0 queued · cap 1+10 uptime 4h 12m
Most local-AI tools today bundle their own copy of llama.cpp, ship their own model fetcher, and load weights into their own process. Open three such tools and you are paying for three copies of the same thirteen-billion-parameter model in RAM, three different model stores on disk, and three competing fetch paths.
inferd inverts that. One daemon holds the warm models — one for generation, one for embeddings — and exposes a small, frozen wire protocol. Every other tool — CLI assistants, IDE plugins, agent runtimes, web apps, middleware — connects as a client. No engine bundling. No second copy. No fight over the GPU.
A 7B model in 8-bit quantization is ~7 GB resident. Two consumers running their own copy means two copies in RAM. Five means a swap-thrashed laptop. One warm copy, shared, means the math stops getting worse the more tools you adopt.
Loading a model and allocating its KV-cache takes seconds. Doing it at every CLI invocation makes local inference feel slower than a hosted API. inferd loads once, at boot. After that, requests are warm.
Three vendored copies of llama.cpp means three different versions, three different GGUF compatibility matrices, three different fetch-and-verify implementations. inferd centralises all of it behind a contract that does not care which engine is underneath.
NDJSON over a Unix socket on Linux/macOS, a named pipe on Windows. 64 MiB per-frame cap. Frozen at v1. Anything on the host that can open a socket can drive it.
┌──────────────────────────────────────┐
│ user-space tools │
│ cli · ide plugin · agent · web app │
└──────────────────────────────────────┘
│ │ │ │
│ ndjson over uds / named pipe
│ frame cap 64 MiB · v1 frozen
▼ ▼ ▼ ▼
┌──────────────────────────────────────┐
│ inferd │
│ admission queue · 1 active + 10 │
│ backend trait · router · activity │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ warm models in memory │
│ ├─ generate · gemma-4-e4b │
│ └─ embed · embeddinggemma-300m │
│ llama.cpp via FFI (default) │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ shared content-addressable store │
│ $MODELS_HOME/blobs/sha256/aa/... │
└──────────────────────────────────────┘
Send a messages array, get tokens back as they're produced, terminated by one done frame. No retry, no fallback, no rewriting prompts — the daemon stays out of your way.
Send text, get a 768-dim vector — or 256-dim via Matryoshka truncation. Bounded inputs return structured errors instead of crashing the daemon (issue #20, fixed in v0.2.4).
A separate socket exposes ready-state, queue depth, model load progress, and activity events. Progress UIs connect during boot, before the inference socket even opens.
// client → daemon (one NDJSON line) {"id":"r1","messages":[{"role":"user","content":"hi"}],"max_tokens":32} // daemon → client (streamed) {"type":"token","id":"r1","text":"Hello"} {"type":"token","id":"r1","text":" there"} {"type":"token","id":"r1","text":"!"} {"type":"done","id":"r1","content":"Hello there!","usage":{"prompt_tokens":12,"completion_tokens":3},"backend":"llamacpp"}
Pre-built tarballs for Linux x86_64, macOS aarch64, and Windows x86_64 are attached to every release. Each platform ships a per-user installer that drops the daemon in your local prefix and registers it with the appropriate init system.
systemd --user unit. No sudo, no system-wide service.
curl -L \ https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-x86_64-unknown-linux-gnu.tar.gz \ -o inferd.tar.gz tar xzf inferd.tar.gz ./packaging/systemd/install-user.sh
launchd LaunchAgent, per-user, Apple Silicon.
curl -L \ https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-aarch64-apple-darwin.tar.gz \ -o inferd.tar.gz tar xzf inferd.tar.gz ./packaging/launchd/install-launchagent.sh
Per-user Startup-folder shortcut. No SCM service.
Invoke-WebRequest ` https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-x86_64-pc-windows-msvc.zip ` -OutFile inferd.zip Expand-Archive inferd.zip .\packaging\windows\install-user.ps1
Always verify the SHA-256 of the tarball against the value on the release page before extracting. Latest release ↗
inferd has a deliberately small surface. The list of things it will never do is part of the design. These are not feature gaps; they are scope decisions.