inferd
v0.2.4 · MIT · Rust · Linux · macOS · Windows

A technical note on local inference plumbing

One warm copy.
Many local consumers.

inferd is a single host-wide local-inference daemon. One generate model, one embed model, held warm in memory and shared by every tool on the machine that wants them — over a small NDJSON-over-IPC protocol. It is plumbing, not a product.

$ inferdctl status
daemon      ready
backend     llamacpp · gemma-4-e4b · cpu
embed       llamacpp · embeddinggemma-300m · 768→256 MRL
queue       0 active · 0 queued · cap 1+10
uptime      4h 12m
§1

What inferd is

Most local-AI tools today bundle their own copy of llama.cpp, ship their own model fetcher, and load weights into their own process. Open three such tools and you are paying for three copies of the same thirteen-billion-parameter model in RAM, three different model stores on disk, and three competing fetch paths.

inferd inverts that. One daemon holds the warm models — one for generation, one for embeddings — and exposes a small, frozen wire protocol. Every other tool — CLI assistants, IDE plugins, agent runtimes, web apps, middleware — connects as a client. No engine bundling. No second copy. No fight over the GPU.

§2

Why this exists

  1. RAM is finite, models are large.

    A 7B model in 8-bit quantization is ~7 GB resident. Two consumers running their own copy means two copies in RAM. Five means a swap-thrashed laptop. One warm copy, shared, means the math stops getting worse the more tools you adopt.

  2. Cold starts ruin local UX.

    Loading a model and allocating its KV-cache takes seconds. Doing it at every CLI invocation makes local inference feel slower than a hosted API. inferd loads once, at boot. After that, requests are warm.

  3. Per-app engines drift.

    Three vendored copies of llama.cpp means three different versions, three different GGUF compatibility matrices, three different fetch-and-verify implementations. inferd centralises all of it behind a contract that does not care which engine is underneath.

  4. The wire protocol is the product.

    NDJSON over a Unix socket on Linux/macOS, a named pipe on Windows. 64 MiB per-frame cap. Frozen at v1. Anything on the host that can open a socket can drive it.

§3

The shape of it

                    ┌──────────────────────────────────────┐
                    │            user-space tools          │
                    │  cli · ide plugin · agent · web app  │
                    └──────────────────────────────────────┘
                         │       │       │       │
                         │ ndjson over uds / named pipe
                         │ frame cap 64 MiB · v1 frozen
                         ▼       ▼       ▼       ▼
                    ┌──────────────────────────────────────┐
                    │                inferd                │
                    │   admission queue · 1 active + 10    │
                    │   backend trait · router · activity  │
                    └──────────────────────────────────────┘
                                      │
                                      ▼
                    ┌──────────────────────────────────────┐
                    │   warm models in memory              │
                    │   ├─ generate · gemma-4-e4b          │
                    │   └─ embed    · embeddinggemma-300m  │
                    │   llama.cpp via FFI (default)        │
                    └──────────────────────────────────────┘
                                      │
                                      ▼
                    ┌──────────────────────────────────────┐
                    │  shared content-addressable store    │
                    │  $MODELS_HOME/blobs/sha256/aa/...    │
                    └──────────────────────────────────────┘

Generate

Send a messages array, get tokens back as they're produced, terminated by one done frame. No retry, no fallback, no rewriting prompts — the daemon stays out of your way.

Embed

Send text, get a 768-dim vector — or 256-dim via Matryoshka truncation. Bounded inputs return structured errors instead of crashing the daemon (issue #20, fixed in v0.2.4).

Admin

A separate socket exposes ready-state, queue depth, model load progress, and activity events. Progress UIs connect during boot, before the inference socket even opens.

One generate request, on the wire
// client → daemon (one NDJSON line)
{"id":"r1","messages":[{"role":"user","content":"hi"}],"max_tokens":32}

// daemon → client (streamed)
{"type":"token","id":"r1","text":"Hello"}
{"type":"token","id":"r1","text":" there"}
{"type":"token","id":"r1","text":"!"}
{"type":"done","id":"r1","content":"Hello there!","usage":{"prompt_tokens":12,"completion_tokens":3},"backend":"llamacpp"}
§4

Install

Pre-built tarballs for Linux x86_64, macOS aarch64, and Windows x86_64 are attached to every release. Each platform ships a per-user installer that drops the daemon in your local prefix and registers it with the appropriate init system.

Linux

systemd --user unit. No sudo, no system-wide service.

curl -L \
  https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-x86_64-unknown-linux-gnu.tar.gz \
  -o inferd.tar.gz
tar xzf inferd.tar.gz
./packaging/systemd/install-user.sh

macOS

launchd LaunchAgent, per-user, Apple Silicon.

curl -L \
  https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-aarch64-apple-darwin.tar.gz \
  -o inferd.tar.gz
tar xzf inferd.tar.gz
./packaging/launchd/install-launchagent.sh

Windows

Per-user Startup-folder shortcut. No SCM service.

Invoke-WebRequest `
  https://github.com/3rg0n/inferd/releases/download/v0.2.4/inferd-v0.2.4-x86_64-pc-windows-msvc.zip `
  -OutFile inferd.zip
Expand-Archive inferd.zip
.\packaging\windows\install-user.ps1

Always verify the SHA-256 of the tarball against the value on the release page before extracting. Latest release ↗

§5

What it isn't

inferd has a deliberately small surface. The list of things it will never do is part of the design. These are not feature gaps; they are scope decisions.

  • never An HTTP server. IPC only.
  • never An OpenAI-compatible surface in the daemon. Build it as a separate process if you want one.
  • never Per-request backend override on the wire. Operators choose backends, callers don't.
  • never In-daemon retry, fallback, or mid-stream failover. The caller owns retry.
  • never A multi-generate-model warm pool. One generate model per process (an embed model alongside is fine — different role). Need two generate models? Run two daemons.
  • never Breaking wire-protocol changes. v1 is frozen. v2 will live on a separate socket.