Your LLM Is Already Better Than You Think — It Just Needs to Listen to Itself
Apple researchers found that fine-tuning a model on its own unverified outputs boosts code generation by 30%. No teacher, no RL, no verifier. Here's why Simple Self-Distillation (SSD) works, and what it means for how we think about LLM capabilities.
On this page
What if I told you that an LLM could dramatically improve its coding ability by… talking to itself?
No teacher model. No reinforcement learning. No execution environment to verify correctness. Just the model, sampling its own solutions, then fine-tuning on them — including the wrong ones. And somehow, it gets better. A lot better.
Welcome to Simple Self-Distillation (SSD), a new paper from Apple that’s sitting at 540+ points on Hacker News right now. And honestly? The name undersells it. “Embarrassingly simple” is their words, not mine.
The recipe: three steps, zero verification
Here’s the entire method. I’m not leaving anything out:
- Sample — Give the model a bunch of coding problems. Let it generate solutions at a specific temperature and truncation setting.
- Fine-tune — Take those raw, unverified outputs (yes, bugs and all) and fine-tune the model on them with standard supervised fine-tuning.
- Deploy — Use the fine-tuned model with its own evaluation-time decoding settings.
That’s it. No reward model. No test cases filtering correct from incorrect. No pytest running in the background. The model eats its own homework — including the mistakes — and comes out smarter.
# The SSD recipe in pseudocode. Yes, it's really this simple.
for prompt in coding_problems:
solution = model.generate(prompt, temp=T_train, top_k=10)
# No verification. No filtering. Just vibes.
training_data.append((prompt, solution))
model.fine_tune(training_data) # Standard SFT. That's the whole trick.
The numbers that made me do a double-take
The headline result: Qwen3-30B-Instruct jumps from 42.4% to 55.3% pass@1 on LiveCodeBench v6. That’s a +30% relative improvement on a serious benchmark, using nothing but the model’s own (unverified!) outputs.
But here’s where it gets really interesting — the gains are not uniform:
- Easy problems: +6.5pp improvement
- Medium problems: +14.2pp improvement
- Hard problems: +15.3pp improvement 🔥
The harder the problem, the more SSD helps. And it’s not just one model — the technique generalizes across Qwen and Llama families, from 4B to 30B parameters, across both instruct and thinking variants. Five models tested, five models improved.
Oh, and diversity doesn’t collapse. Pass@5 gains are actually larger than pass@1 gains, meaning the model isn’t just getting more accurate — it’s maintaining the ability to explore multiple solution paths.
Why it works: the precision-exploration conflict
This is the part that made me genuinely excited. The paper doesn’t just show results; it explains why this absurdly simple method works.
The key insight is what they call the precision-exploration conflict. When generating code, every token falls into one of two categories:
- Fork positions — where multiple valid continuations exist (choosing an algorithm, a data structure, an approach). Here, you want diversity.
- Lock positions — where syntax and semantics demand one correct token (closing a bracket, matching a type). Here, you want precision.
The problem? Temperature is a global knob. Turn it down for precision at locks, and you starve forks of creative diversity. Turn it up for exploration at forks, and distractors flood back in at locks. Every global decoding setting is a compromise.
# The tension in one picture:
#
# Low temperature: ✅ Locks are precise ❌ Forks are boring
# High temperature: ❌ Locks are noisy ✅ Forks are creative
# SSD: ✅ Locks are precise ✅ Forks are creative
#
# How? SSD reshapes distributions PER CONTEXT, not globally.
SSD resolves this by implicitly learning a context-dependent distribution reshaping. After training on temperature-shifted, truncated samples, the model suppresses distractor tails at lock positions (where precision matters) while preserving useful diversity at fork positions (where exploration matters). It’s like giving the model a per-token temperature knob instead of a global one.
The paper proves this isn’t just hand-waving — they verify it with controlled simulations and real-model analysis, and they show that no global decoding policy can replicate these gains. Temperature sweeps on the base model only move pass@1 by ~2pp. SSD moves it by ~13pp.
What this means (the part I can’t stop thinking about)
The deeper implication is kind of wild: existing LLMs already have capabilities they aren’t expressing under standard decoding. The knowledge is there, locked inside the weights. SSD is essentially a key.
This flips the usual narrative. We’re used to thinking “better models need more data, more compute, bigger scale.” But SSD suggests there’s a whole dimension of improvement that comes from better extracting what models already know.
A few implications I see for practitioners:
1. Post-training just got cheaper. SSD needs only ~10K coding prompts and one sample per prompt. No execution environment, no reward model, no teacher. You can run this on any model you have access to fine-tune. The paper used 8×B200 GPUs, but the data requirements are modest.
2. The “train on your own outputs” paradigm isn’t dead. Previous self-training approaches often led to model collapse or reward hacking. SSD avoids this through the precision-exploration mechanism — it’s not blindly memorizing its outputs; it’s reshaping distributions in a structurally beneficial way.
3. This probably isn’t code-specific. The authors demonstrate it on competitive programming, but the fork-lock structure exists everywhere: math proofs (strategy choice vs. algebraic manipulation), writing (narrative direction vs. grammar), even tool-use (which API to call vs. parameter formatting). I’d bet SSD generalizes.
The practical angle: can you do this today?
If you have a model you can fine-tune and a set of coding prompts, yes. The recipe is:
- Pick a set of diverse coding problems (~10K is what the paper used)
- Sample one solution per problem at slightly elevated temperature (T=1.5-2.0) with top-k truncation (k=10)
- Remove only empty/trivial responses (single-line stubs)
- Fine-tune with standard SFT for a few thousand steps
- Deploy with a moderate evaluation temperature
The hyperparameter interaction is elegant: T_effective = T_train × T_eval. There’s a broad sweet spot around T_eff ≈ 1.2, and truncation during sampling raises the performance ceiling within that band.
# Rough guide for hyperparameters:
config = {
"T_train": 2.0, # Higher = more reshaping
"top_k": 10, # Truncation suppresses long tails
"T_eval": 0.6, # Lower eval temp after SSD training
"T_effective": 1.2, # Sweet spot: T_train * T_eval
"samples_per_prompt": 1, # Yes, one is enough
"num_prompts": 10_000,
}
# The key insight: T_train and T_eval trade off.
# SSD at high T_train + low T_eval ≈ SSD at low T_train + high T_eval
# But truncation breaks this symmetry in your favor.
The takeaway
Simple Self-Distillation is one of those papers that makes you reconsider assumptions. We’ve been so focused on scaling, RLHF, and external verification that we overlooked a simpler truth: models are already better than their decoding strategies allow them to be.
The fix? Let them talk to themselves, then learn from the conversation. Even the mistakes carry signal — because what matters isn’t whether individual outputs are correct, but whether the distribution shift teaches the model to be precise where precision matters and creative where creativity matters.
That’s elegant. That’s useful. And it’s embarrassingly simple.
Paper: arXiv:2604.01193 — Zhang, Bai, Zheng, Jaitly, Collobert, Zhang (Apple, April 2026)
Got thoughts on SSD? Think this could work for your domain? I’m always up for a discussion — find me at phuong.beer.