Bleeding Llama: When Local Ollama Leaks Memory Over the Network

Bleeding Llama: When Local Ollama Leaks Memory Over the Network

CVE-2026-7482: crafted GGUF leaks heap via quantize—prompts, keys, neighbor chats. I explain Bleeding Llama and how I lock down Ollama. 🔐


On this page

If you’ve pointed relatives at “run Llama on your laptop” guides, you’ve probably sold local inference as privacy. That story isn’t wrong—but network-exposed Ollama isn’t “just on my machine.” It’s a daemon strangers can talk to unless you cage it.

Researchers named Bleeding Llama (CVE-2026-7482, CVSS 9.1, CWE‑125 out-of-bounds read) because a malicious GGUF upload plus model creation can pull adjacent heap bytes—conversation text, prompts, sometimes environment variables—into an artifact you can push off-box. The upstream REST surface behind /api/create and /api/push ships without authentication; combine that with 0.0.0.0 binds (common in Docker/README patterns), and the blast radius gets ugly fast.

I’m writing this like I’d explain it over tea to someone I care about: what breaks, how attackers chain it, and what I’d actually do Monday morning.

If you upload models or call /api/create against anything older than Ollama 0.17.1, stop reading and upgrade first—the fix landed in v0.17.1 (PR #14406). Everything below assumes you’re patched or air-gapped.

Two sentences on GGUF (because the bug lives there)

GGUF stores weights as tensors—named arrays with shape, type, and offsets into the file blob. When Ollama quantizes (say F16→F32), it walks tensors and converts element counts derived from that shape.

GGUF is attacker-controlled binary. If the header claims ten million elements but the file only holds a postage stamp of bytes, an unchecked loop will read past the buffer—classic heap out-of-bounds read.

Why “memory-safe Go” still bleeds here

Go is safe until it isn’t. Ollama’s hot path uses unsafe for speed during WriteTo() conversion (see NVD’s description: fs/ggml/gguf.go, server/quantization.go). Cyera’s analysis spells out the trap: conversion calls like ConvertToF32 trust Elements()—the product of tensor dimensions—to size the read. Inflate the dimensions, you inflate the read.

Researchers kept leaked bytes losslessly by picking conversions that preserve bits—F16→F32—so secrets survive quantization instead of turning into numeric mush (Cyera Research).

The chain I picture when someone says “three API calls”

Unauthenticated client talks to your daemon:

  1. Stage a crafted GGUF via the blob upload endpoint (PUT /api/blobs/sha256:[digest] pattern—digest matches body hash).
  2. Trigger /api/create referencing that blob with quantize set so WriteTo() executes—this is where the OOB read fires and foreign heap bytes mix into the output tensor buffers.
  3. Exfil with /api/push. Cyera noted Ollama accepts a model name that looks like a URL and will ship the artifact to an attacker-controlled registry endpoint—so the leaked material leaves your network as a “model” upload.

Bleeding Llama — conceptual chain

Attacker stages a blob, triggers quantized create (OOB read), then pushes the resulting artifact outward.
View diagram source

sequenceDiagram
participant A as "Attacker"
participant O as "Ollama"
participant R as "Exfil host"
A->>O: PUT blob (crafted GGUF digest)
Note over O: Stores blob by SHA-256 key
A->>O: POST /api/create (quantize path)
O->>O: Quantization reads past buffer (heap leak into tensors)
A->>O: POST /api/push (model name as URL)
O->>R: Upload artifact with leaked bytes

That isn’t magic—it’s trust boundaries dissolving: anyone who can reach the API becomes someone who can stage model bytes and request quantization.

What actually leaks

Cyera demoed recovered user prompts, system prompts, and environment variables from the server process—exactly what you’d fear on a shared inference box: neighbor tenants’ chat fragments, integration secrets in env vars, maybe API keys your IDE injected into the environment. Secondary reporting (SecurityWeek, The Hacker News) cites large internet-exposed populations—your risk hinges on exposure and patch state, not headline counts.

If you bridge tools like Claude Code into local inference, tool transcripts can land in process memory too—same heap, same leak class.

Am I affected? A 60-second check

Two questions decide your blast radius: old version? and open bind? Either alone is bad; both together is the worst-case combo.

# 1. Version — vulnerable below 0.17.1
ollama --version

# 2. Is the API answering beyond loopback?
ss -lntp | grep 11434
# Bad: 0.0.0.0:11434 or [::]:11434
# Good: 127.0.0.1:11434

# 3. From another host on the LAN, can you list models?
curl -sS --max-time 3 http://<lan-ip>:11434/api/tags | head

# 4. Docker users — confirm the published port isn't world-bound
docker inspect ollama --format '{{json .NetworkSettings.Ports}}'

If step 3 returns JSON instead of timing out, you have the unauthenticated upstream surface that Bleeding Llama rides on.

What I’d do about it (defense-in-depth, not vibes)

Patch like rent is due. Echo (CVE record) lists v0.17.1 plus fix commit 88d57d0.

Stop advertising Ollama to the world. Bind 127.0.0.1, isolate Docker hosts on internal networks, VPN-only access for remote users.

Put auth in front. Reverse proxy or API gateway with mTLS, OIDC, or at least network ACLs—assume the upstream daemon is naked.

Shrink blast radius. Separate workloads per team/project; never park prod secrets in the same env as experimental bots.

Audit exposure. ss -lntp, cloud SGs, home-router port forwards—if TCP 11434 answers from the coffee shop Wi‑Fi, you already made the choice attackers love.

I treat network-reachable Ollama like an anonymous write-capable datastore: if I wouldn’t leave Postgres open without auth, I won’t leave model ingestion endpoints open either—especially ones that accept blobs and spawn quantization work.

Concrete configs I actually paste

systemd drop-in — pin loopback, even if a future package flips the default:

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_ORIGINS=http://127.0.0.1,http://localhost"
sudo systemctl daemon-reload && sudo systemctl restart ollama

Docker — never publish to 0.0.0.0; bind explicitly:

# Bad:  -p 11434:11434       (binds 0.0.0.0)
# Good: -p 127.0.0.1:11434:11434
docker run -d --name ollama \
  -p 127.0.0.1:11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama:0.17.1

Caddy in front — bearer token gate so even local users prove they meant it:

ollama.lan {
  @auth header Authorization "Bearer {env.OLLAMA_TOKEN}"
  handle @auth { reverse_proxy 127.0.0.1:11434 }
  respond 401
}

What v0.17.1 actually changed

The fix (commit 88d57d0, PR #14406) is unsexy in the best way: bound the read by the file, not the header. Conversion paths now cross-check Elements() against the on-disk tensor size before allocating or copying, so a lying GGUF header can’t coax WriteTo() into walking off the buffer. No new auth, no new ACLs—just the missing length check that should have been there since the first unsafe block. Lesson for the rest of us: any time you hand unsafe a number that came from a file, the file is the attacker until proven otherwise.

Separate issue: Windows auto-update flaws

Linux/macOS heap leak is not the same story as the Windows updater. Coordinated disclosure from Striga/CERT Polska tracks different CVEs (CVE-2026-42248, CVE-2026-42249) around unsigned update payloads and path traversal in the Windows update path—worth reading if your family runs Ollama desktops there (CERT Polska, Striga research). Patch timelines differ from Bleeding Llama; treat them as their own incident checklist.

Sources I’d send my folks

Local AI still buys you control—but control includes patching, binding interfaces, and proving your daemon isn’t shouting across the internet. I’d rather own those knobs than own a headline.

Thread

0
⌘/Ctrl+Enter to sendType / for commands · Tab to @mention