<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Tprrt's Blog - Self-Hosting</title><link href="https://tprrt.tupi.fr/" rel="alternate"/><link href="https://tprrt.tupi.fr/feeds/self-hosting.atom.xml" rel="self"/><id>https://tprrt.tupi.fr/</id><updated>2026-06-23T21:00:00+02:00</updated><subtitle>Embedded Linux, Zephyr RTOS, open-source hardware, Linux gaming, retro gaming, and competitive fitness</subtitle><entry><title>llm-companion: A Self-Hosted, Privacy-First AI Coding Assistant</title><link href="https://tprrt.tupi.fr/llm-companion-self-hosted-ai-coding-assistant.html" rel="alternate"/><published>2026-06-23T21:00:00+02:00</published><updated>2026-06-23T21:00:00+02:00</updated><author><name>tperrot</name></author><id>tag:tprrt.tupi.fr,2026-06-23:/llm-companion-self-hosted-ai-coding-assistant.html</id><summary type="html">&lt;p class="first last"&gt;A look at llm-companion, a rootless Podman Kubernetes Pod stack
that turns a spare Fedora or Debian box into a private, hardware-aware
Ollama server for OpenCode — with a practical guide to deploying it.&lt;/p&gt;
</summary><content type="html">&lt;p&gt;&lt;strong&gt;I did not want my code to leave my network.&lt;/strong&gt; Every agentic coding session
sends a stream of file contents, project structure, and half-finished thoughts
to whatever model answers the prompts. Routing all of that through a third-party
API felt like the wrong default, even when the provider is trustworthy: it is
recurring cost for routine work, it stops working the moment the LAN or VPN
does not reach the internet, and it teaches me nothing about how the serving
side of an LLM stack actually behaves under constrained hardware. So I built
&lt;a class="reference external" href="https://github.com/tprrt/llm-companion"&gt;llm-companion&lt;/a&gt;, a rootless Ollama stack for Fedora Server and Debian that I
can run on a spare machine at home and point &lt;a class="reference external" href="https://opencode.ai"&gt;OpenCode&lt;/a&gt; at, with cloud
providers wired in only as an explicit fallback rather than the default path.&lt;/p&gt;
&lt;p&gt;This article walks through what the stack looks like, why it is built the way
it is, and how to deploy it yourself.&lt;/p&gt;
&lt;div class="section" id="what-llm-companion-is"&gt;
&lt;h2&gt;What llm-companion Is&lt;/h2&gt;
&lt;p&gt;At its core, &lt;a class="reference external" href="https://github.com/tprrt/llm-companion"&gt;llm-companion&lt;/a&gt; is a single Kubernetes Pod manifest
(&lt;tt class="docutils literal"&gt;kube/stack.yml&lt;/tt&gt;) deployed by Ansible, running five containers that share
one network namespace:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Internet / LAN / VPN
        │
     :8080  ← firewalld / ufw opens only this port
        │
┌────────────────────────────────────────────────────────┐
│  llm-companion Pod  (shared network namespace)         │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │  caddy  :8080 (hostPort)                         │  │
│  │  Bearer token auth on /ollama/api/* /ollama/v1/* │  │
│  │  Passes /searxng/* to SearXNG (Bearer token)     │  │
│  │  Passes / to Open WebUI                          │  │
│  └───────────────────┬──────────────────────────────┘  │
│                      │ localhost                       │
│  ┌───────────────────▼──┐  ┌───────────────────────┐   │
│  │  ollama :11434       │  │  open-webui :3000     │   │
│  │  (internal)          │  └────────┬──────────┬───┘   │
│  └──────────────────────┘           │          │       │
│                              ┌──────▼──┐  ┌────▼──────┐│
│                              │ searxng │  │   open-   ││
│                              │  :8888  │  │ terminal  ││
│                              │         │  │  :8000    ││
│                              └─────────┘  └───────────┘│
└────────────────────────────────────────────────────────┘
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a class="reference external" href="https://ollama.com"&gt;Ollama&lt;/a&gt; serves the models, &lt;a class="reference external" href="https://github.com/open-webui/open-webui"&gt;Open WebUI&lt;/a&gt; provides a chat interface with
document/RAG support, &lt;a class="reference external" href="https://github.com/searxng/searxng"&gt;SearXNG&lt;/a&gt; gives the chat agent web search without
sending queries to a third party, and Open Terminal gives the agent a sandboxed
shell. &lt;a class="reference external" href="https://caddyserver.com"&gt;Caddy&lt;/a&gt; is the only container exposed to the host network, and it
enforces a Bearer token on every API route.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/open-webui/open-webui"&gt;Open WebUI&lt;/a&gt; is the browser-facing piece: besides the chat interface, it
keeps its own user accounts and conversation history, and lets you upload
documents for retrieval-augmented generation without standing up a separate
vector store just for that.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/searxng/searxng"&gt;SearXNG&lt;/a&gt; is a self-hosted metasearch engine — it aggregates results from
other search engines and returns them without forwarding the query to any
single one of them, which is what lets the agent's web-search tool stay
consistent with the rest of the stack's no-third-party-by-default stance.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://caddyserver.com"&gt;Caddy&lt;/a&gt; is the reverse proxy and, as noted above, the only place auth is
enforced — it is also the only container that would need to know about TLS,
so adding HTTPS later (if this ever leaves the LAN) is a Caddyfile change,
not a new container.&lt;/p&gt;
&lt;p&gt;Open Terminal gives a sandboxed shell on the pod, reachable from the
browser — useful for checking logs or restarting a service without opening
a separate SSH session.&lt;/p&gt;
&lt;p&gt;The whole thing targets two use cases: chatting through Open WebUI from a
browser, and routing &lt;a class="reference external" href="https://opencode.ai"&gt;OpenCode&lt;/a&gt;'s agentic coding sessions through Ollama's
OpenAI-compatible API — the same workflow you would normally point at Claude
or GPT-4o, but served from hardware you control.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="why-it-is-built-this-way"&gt;
&lt;h2&gt;Why It Is Built This Way&lt;/h2&gt;
&lt;p&gt;A few decisions in the stack are not obvious from the README's quick-start, but
they are the part I actually learned something from.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One Pod, one exposed port.&lt;/strong&gt; All five containers share a single network
namespace and talk to each other over &lt;tt class="docutils literal"&gt;localhost&lt;/tt&gt;, not DNS names. Only Caddy
publishes a &lt;tt class="docutils literal"&gt;hostPort&lt;/tt&gt;. This means the firewall rule is one line
(&lt;tt class="docutils literal"&gt;8080/tcp&lt;/tt&gt;), and there is exactly one place — the Caddyfile — where
authentication is enforced. Open WebUI and Open Terminal are never reachable
directly, even from the LAN.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bearer auth at the proxy, not in each service.&lt;/strong&gt; Ollama and SearXNG have no
authentication of their own. Caddy terminates every request and checks a
Bearer token before forwarding to &lt;tt class="docutils literal"&gt;/ollama/api/*&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;/ollama/v1/*&lt;/tt&gt;, or
&lt;tt class="docutils literal"&gt;/searxng/*&lt;/tt&gt;. Open WebUI keeps its own login, since it already has user
accounts. Centralizing auth in the proxy means rotating the key
(&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;generate-api-key.sh&lt;/span&gt;&lt;/tt&gt;) only touches one Kubernetes Secret, not three
services' configs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A hardware-aware model picker instead of a fixed model list.&lt;/strong&gt; Self-hosted
LLM advice tends to assume either a beefy GPU or hand-picking quantizations
yourself. &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;pull-models.sh&lt;/span&gt;&lt;/tt&gt; detects architecture (x86_64/aarch64), accelerator
(CPU, AMD ROCm, NVIDIA CUDA), and available RAM/VRAM, then selects the best
model per category (coding, vision, general, embedding) that actually fits —
down to a 1.5B coding model and 1.7B reasoning model on a 2 GB ARM64 board, up
to Devstral Small 2 24B on a 16 GB+ GPU. &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;--list&lt;/span&gt;&lt;/tt&gt; shows the plan before
pulling anything.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quadlet over a bare ``podman run``.&lt;/strong&gt; The pod is managed by a Quadlet
&lt;tt class="docutils literal"&gt;.kube&lt;/tt&gt; unit, which gives it normal systemd semantics — &lt;tt class="docutils literal"&gt;systemctl &lt;span class="pre"&gt;--user&lt;/span&gt;
restart &lt;span class="pre"&gt;llm-companion&lt;/span&gt;&lt;/tt&gt;, automatic restart on failure, and
&lt;tt class="docutils literal"&gt;AutoUpdate=registry&lt;/tt&gt; so a &lt;tt class="docutils literal"&gt;podman &lt;span class="pre"&gt;auto-update&lt;/span&gt;&lt;/tt&gt; timer can pull newer
pinned images without manual intervention. Rootless throughout, with
&lt;tt class="docutils literal"&gt;loginctl linger&lt;/tt&gt; so the user service survives without an active login
session — important for a box that is meant to just sit there and serve
requests.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="deploying-it"&gt;
&lt;h2&gt;Deploying It&lt;/h2&gt;
&lt;p&gt;The fastest way to see the stack end-to-end is &lt;a class="reference external" href="https://github.com/tprrt/llm-companion/blob/master/scripts/vm.sh"&gt;vm.sh&lt;/a&gt;, which provisions a
QEMU/KVM VM running the exact same Ansible playbook and &lt;tt class="docutils literal"&gt;kube/stack.yml&lt;/tt&gt; used
on real hardware:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;dnf&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;qemu-kvm&lt;span class="w"&gt; &lt;/span&gt;qemu-img&lt;span class="w"&gt; &lt;/span&gt;wget&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;genisoimage
sudo&lt;span class="w"&gt; &lt;/span&gt;usermod&lt;span class="w"&gt; &lt;/span&gt;-aG&lt;span class="w"&gt; &lt;/span&gt;kvm&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;newgrp&lt;span class="w"&gt; &lt;/span&gt;kvm

git&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;https://github.com/tprrt/llm-companion
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;llm-companion

./scripts/vm.sh&lt;span class="w"&gt; &lt;/span&gt;build&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="c1"&gt;# one-time provisioning (~golden image)&lt;/span&gt;
./scripts/vm.sh&lt;span class="w"&gt; &lt;/span&gt;start&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="c1"&gt;# boots in ~2 minutes from there on&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is how I iterate on the stack itself — rebuild the golden image after a
change, boot, check the services, tear down — without touching real hardware.&lt;/p&gt;
&lt;p&gt;For an actual deployment, copy the example inventory and point it at your
server:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;ansible/inventory/hosts.yml.example&lt;span class="w"&gt; &lt;/span&gt;ansible/inventory/hosts.yml
&lt;span class="nv"&gt;$EDITOR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ansible/inventory/hosts.yml
&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;all&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;children&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;llm_companion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;hosts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;my-server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;ansible_host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;192.168.1.100&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;ansible_user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;fedora&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;ansible_ssh_private_key_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;~/.ssh/id_ed25519&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then run the playbook:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;ansible-playbook&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;ansible/inventory/hosts.yml&lt;span class="w"&gt; &lt;/span&gt;ansible/site.yml
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It handles, in order: required directories and linger (&lt;tt class="docutils literal"&gt;common&lt;/tt&gt;), opening
port 8080 via firewalld or ufw (&lt;tt class="docutils literal"&gt;firewall&lt;/tt&gt;), installing Podman and building
the Ollama image (&lt;tt class="docutils literal"&gt;podman&lt;/tt&gt;), and generating the API key, installing
&lt;tt class="docutils literal"&gt;stack.yml&lt;/tt&gt;, and starting the systemd service (&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;llm-stack&lt;/span&gt;&lt;/tt&gt;). It is
idempotent — re-run it any time you change the inventory or pull new code.&lt;/p&gt;
&lt;p&gt;Pull models sized to your hardware:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;./scripts/pull-models.sh&lt;span class="w"&gt; &lt;/span&gt;--list&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# dry run — see what would be pulled&lt;/span&gt;
./scripts/pull-models.sh&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="c1"&gt;# pull the best model per category&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;On an AMD GPU host, re-run Ansible with &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-e&lt;/span&gt; &amp;quot;ollama_build_target=rocm&amp;quot;&lt;/tt&gt; first
to build the ROCm image and deploy &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;stack-rocm.yml&lt;/span&gt;&lt;/tt&gt; instead, which grants the
container access to &lt;tt class="docutils literal"&gt;/dev/kfd&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;/dev/dri&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="wiring-up-opencode"&gt;
&lt;h2&gt;Wiring Up OpenCode&lt;/h2&gt;
&lt;p&gt;On the client machine, point &lt;a class="reference external" href="https://opencode.ai"&gt;OpenCode&lt;/a&gt; at the server through its
OpenAI-compatible provider config (&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;~/.config/opencode/opencode.json&lt;/span&gt;&lt;/tt&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;$schema&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;https://opencode.ai/config.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;model&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ollama/qwen3-8b-16k&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;provider&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;ollama&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;npm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@ai-sdk/openai-compatible&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Ollama&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;baseURL&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://&amp;lt;server-ip&amp;gt;:8080/ollama/v1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;headers&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Authorization&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bearer sk-ollama-&amp;lt;your-key&amp;gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;models&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;qwen3-8b-16k&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Qwen3 8B — coding/vision/general (16k)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tools&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The key is printed at the end of the Ansible run and stored in
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;~/.config/ollama/api-key.env&lt;/span&gt;&lt;/tt&gt; on the server. Switch models at any time with
&lt;tt class="docutils literal"&gt;/models&lt;/tt&gt; inside OpenCode — no restart needed.&lt;/p&gt;
&lt;p&gt;Cloud providers (Anthropic, GitHub Copilot) can sit alongside the &lt;tt class="docutils literal"&gt;ollama&lt;/tt&gt;
provider in the same config, switched to with the same &lt;tt class="docutils literal"&gt;/models&lt;/tt&gt; command.
That is the fallback path I mentioned earlier: the local stack is the default,
and the cloud is one keystroke away when the network or the hardware cannot
keep up — travelling, a model too large for the box, or the service simply
being down.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="lessons-learned"&gt;
&lt;h2&gt;Lessons Learned&lt;/h2&gt;
&lt;p&gt;Rootless GPU access was the part that fought back the most. ROCm needs
&lt;tt class="docutils literal"&gt;/dev/kfd&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;/dev/dri&lt;/tt&gt; inside the container, which in turn needs
&lt;tt class="docutils literal"&gt;securityContext.privileged: true&lt;/tt&gt; — there is no narrower rootless path to
those device nodes today, so the ROCm variant trades some of the isolation
the CPU variant gets for free. That trade-off is explicit in the stack
(&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;stack-rocm.yml&lt;/span&gt;&lt;/tt&gt; is a separate manifest, not a flag on the default one),
and it is documented as a host that should be dedicated rather than shared.&lt;/p&gt;
&lt;p&gt;The hardware-aware model picker turned out to matter more than I expected.
Hand-picking a quantization for &amp;quot;your&amp;quot; machine works fine for one machine; it
falls apart the moment the same playbook needs to run unchanged on a 2 GB
ARM64 board, an 8 GB CPU-only Fedora box, and a 16 GB GPU desktop. Encoding the
RAM/VRAM gates once, in one script, meant the rest of the stack — Ansible role,
Quadlet unit, Caddy config — never needed to know which tier it was running
on.&lt;/p&gt;
&lt;p&gt;The other recurring theme: most of the actual engineering here is not in
Ollama at all, it is in the boring infrastructure around it — one auth
boundary, one exposed port, one systemd unit, one script that adapts to
whatever box it lands on. That boring part is also what makes me comfortable
leaving it running unattended.&lt;/p&gt;
&lt;/div&gt;
</content><category term="Self-Hosting"/><category term="self-hosting"/><category term="llm"/><category term="ollama"/><category term="podman"/><category term="kubernetes"/><category term="ansible"/><category term="opencode"/><category term="privacy"/><category term="article"/></entry></feed>