A technical note on local inference plumbing

One warm copy.
Many local consumers.

inferd is a single host-wide local-inference daemon. One generate model, one embed model, held warm in memory and shared by every tool on the machine that wants them — over a small IPC protocol (length-prefixed frames for generation, NDJSON for embeddings). It is plumbing, not a product.

Download v0.6.0 Read the source

$ inferdctl doctor
[ ok ] config:    loaded ~/.inferd/config.json (auto_pull=true)
[ ok ] manifest:  gemma-4-e4b · embeddinggemma-300m
[ ok ] admin:     ready
[ ok ] backend:   llamacpp accelerator=metal gpu_layers=99 embed=true
[ ok ] device:    Apple M2 Pro vram=16.0 GiB

§1

What inferd is

Most local-AI tools today bundle their own copy of llama.cpp, ship their own model fetcher, and load weights into their own process. Open three such tools and you are paying for three copies of the same thirteen-billion-parameter model in RAM, three different model stores on disk, and three competing fetch paths.

inferd inverts that. One daemon holds the warm models — one for generation, one for embeddings — and exposes a small, frozen wire protocol. Every other tool — CLI assistants, IDE plugins, agent runtimes, web apps, middleware — connects as a client. No engine bundling. No second copy. No fight over the GPU.

§2

Why this exists

RAM is finite, models are large.

A 7B model in 8-bit quantization is ~7 GB resident. Two consumers running their own copy means two copies in RAM. Five means a swap-thrashed laptop. One warm copy, shared, means the math stops getting worse the more tools you adopt.
Cold starts ruin local UX.

Loading a model and allocating its KV-cache takes seconds. Doing it at every CLI invocation makes local inference feel slower than a hosted API. inferd loads once, at boot. After that, requests are warm.
Per-app engines drift.

Three vendored copies of llama.cpp means three different versions, three different GGUF compatibility matrices, three different fetch-and-verify implementations. inferd centralises all of it behind a contract that does not care which engine is underneath.
The wire protocol is the product.

Length-prefixed, type-tagged frames over a Unix socket on Linux/macOS, a named pipe on Windows (embeddings use NDJSON). 64 MiB per-frame cap. Frozen, with an in-band wire_version. Anything on the host that can open a socket can drive it.

§3

The shape of it

                    ┌──────────────────────────────────────┐
                    │            user-space tools          │
                    │  cli · ide plugin · agent · web app  │
                    └──────────────────────────────────────┘
                         │       │       │       │
                         │ length-prefixed frames over uds / pipe
                         │ frame cap 64 MiB · wire frozen (v2)
                         ▼       ▼       ▼       ▼
                    ┌──────────────────────────────────────┐
                    │                inferd                │
                    │   admission queue · 1 active + 10    │
                    │   backend trait · router · activity  │
                    └──────────────────────────────────────┘
                                      │
                                      ▼
                    ┌──────────────────────────────────────┐
                    │   warm models in memory              │
                    │   ├─ generate · gemma-4-e4b          │
                    │   └─ embed    · embeddinggemma-300m  │
                    │   llama.cpp via FFI (default)        │
                    └──────────────────────────────────────┘
                                      │
                                      ▼
                    ┌──────────────────────────────────────┐
                    │   runtime accelerator probe (v0.3)   │
                    │   metal · cuda · rocm · vulkan · cpu │
                    │   strongest available, picked at boot │
                    └──────────────────────────────────────┘
                                      │
                                      ▼
                    ┌──────────────────────────────────────┐
                    │  shared content-addressable store    │
                    │  $MODELS_HOME/blobs/sha256/aa/...    │
                    └──────────────────────────────────────┘

Generate

Send a messages array, get tokens back as they're produced, terminated by one done frame. No retry, no fallback, no rewriting prompts — the daemon stays out of your way.

Embed

Send text, get a 768-dim vector — or 256-dim via Matryoshka truncation. Bounded inputs return structured errors instead of crashing the daemon (issue #20, fixed in v0.3.0).

Admin

A separate socket exposes ready-state, queue depth, model load progress, and activity events. Progress UIs connect during boot, before the inference socket even opens.

One generate request, on the wire

// client → daemon — one length-prefixed JSON frame
// (each frame: [uvarint len][0x01 = JSON][payload], shown as payload)
{"wire_version":1,"id":"r1","messages":[{"role":"user","content":[{"type":"text","text":"hi"}]}],"max_tokens":32}

// daemon → client (streamed, one length-prefixed JSON frame each)
{"type":"frame","id":"r1","block":{"type":"text","delta":"Hello"}}
{"type":"frame","id":"r1","block":{"type":"text","delta":" there"}}
{"type":"frame","id":"r1","block":{"type":"text","delta":"!"}}
{"type":"done","id":"r1","usage":{"input_tokens":12,"output_tokens":3},"stop_reason":"end_turn","backend":"llamacpp"}

§4

Install

Pre-built tarballs for Linux (x86_64 + arm64), macOS arm64, and Windows (x86_64 + arm64) are attached to every release. Each platform ships a per-user installer that drops the daemon in your local prefix and registers it with the appropriate init system.

Linux

systemd --user unit. No sudo, no system-wide service.

tar xzf inferd-v0.6.0-x86_64-unknown-linux-gnu.tar.gz
cd  inferd-v0.6.0-x86_64-unknown-linux-gnu
mkdir -p ~/.local/bin
cp inferd-daemon inferdctl backends/* ~/.local/bin/
mkdir -p ~/.config/systemd/user
cp packaging/inferd.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now inferd

macOS

launchd LaunchAgent, per-user, Apple Silicon.

tar xzf inferd-v0.6.0-aarch64-apple-darwin.tar.gz
cd  inferd-v0.6.0-aarch64-apple-darwin
./packaging/install-launchagent.sh ./inferd-daemon

Windows

Per-user Startup-folder shortcut. No SCM service.

Expand-Archive inferd-v0.6.0-x86_64-pc-windows-msvc.zip
cd  inferd-v0.6.0-x86_64-pc-windows-msvc
powershell -ExecutionPolicy Bypass `
  -File .\packaging\install.ps1 `
  -SourceBinary .\inferd-daemon.exe

Always verify the SHA-256 of the tarball against the value on the release page before extracting. Latest release ↗

§5

What it isn't

inferd has a deliberately small surface. The list of things it will never do is part of the design. These are not feature gaps; they are scope decisions.

never A network listener of any kind. No inbound TCP, even on loopback — the daemon binds only a Unix socket or named pipe. Reaching inferd over the network is a separate bridge process's job.
never An HTTP server. IPC only.
never An OpenAI-compatible surface in the daemon. Build it as a separate process if you want one.
never Per-request backend override on the wire. Operators choose backends, callers don't.
never In-daemon retry, fallback, or mid-stream failover. The caller owns retry.
never A multi-generate-model warm pool. One generate model per process (an embed model alongside is fine — different role). Need two generate models? Run two daemons.
never Silent breaking wire-protocol changes. The wire is frozen; a breaking change bumps the in-band wire_version so a mismatch fails loudly.

One warm copy. Many local consumers.

What inferd is

Why this exists

RAM is finite, models are large.

Cold starts ruin local UX.

Per-app engines drift.

The wire protocol is the product.

The shape of it

Generate

Embed

Admin

Install

Linux

macOS

Windows

What it isn't

One warm copy.
Many local consumers.