MiniMax M2.7: When AI Agents Rewrite Their Own Code

MiniMax M2.7: When AI Agents Rewrite Their Own Code

MiniMax M2.7 achieves 30% performance gains without retraining by treating its agent harness as mutable infrastructure — and it changes everything about how we deploy AI


On this page

Most AI models in production are frozen artifacts. Train once, ship, and wrap in a static harness of tools, memory, and workflow rules. When something breaks, a human patches the scaffold. The model never touches its own architecture.

MiniMax M2.7 treats that scaffold as code it can rewrite. And the results are quietly terrifying for anyone who’s built a business around bigger models.

The Harness as Living Infrastructure

Every agent operates within constraints: which tools it can call, what skills it possesses, how it structures memory, and what workflow rules govern its decisions. Traditionally, humans design these constraints. M2.7 treats them as mutable and self-optimizable.

The architecture runs a continuous loop: execute a task, analyze failures, plan harness changes, apply them, evaluate against benchmarks, then decide whether to keep or revert. After each iteration, the agent writes self-criticism into memory — so the next round starts with accumulated lessons.

Self-Optimization Loop

M2.7's iterative harness improvement cycle: analyze → plan → modify → evaluate → decide
View diagram source
flowchart TD
  A[Execute Task] --> B[Analyze Failures]
  B --> C[Plan Harness Changes]
  C --> D[Modify Scaffold Code]
  D --> E[Run Evaluations]
  E --> F[Compare Results]
  F --> G{Keep or Revert}
  G -->|Keep| H[Update Harness]
  G -->|Revert| I[Rollback Changes]
  H --> J[Write Self-Criticism]
  I --> J
  J --> K[Next Round]
  K --> A

Over 100 internal rounds, M2.7 discovered optimizations no human specified. It systematically tuned sampling parameters (temperature, frequency penalty, presence penalty). It wrote workflow rules like “automatically check for the same bug pattern in other files after a fix.” It added loop detection to avoid repetitive failure cycles.

Result: 30% performance improvement on internal benchmarks. Zero gradient updates. No retraining. Just better scaffolding.

The RL Team Workflow Already Running

MiniMax didn’t just run internal experiments — they put M2.7 into production with their own RL team. Here’s how it works:

RL Team Research Agent Workflow

M2.7 handles 30-50% of reinforcement learning research workflow autonomously
View diagram source
flowchart TB
  A["Researcher — Discuss Idea"] --> B["Agent — Literature Review"]
  B --> C["Agent — Track Experiment Spec"]
  C --> D["Agent — Pipeline Data"]
  D --> E["Agent — Launch Experiments"]
  E --> F["Agent — Monitor and Profile"]
  F --> G["Agent — Log Analysis"]
  G --> H["Agent — Debug and Fix Code"]
  H --> I["Agent — Merge Requests"]
  I --> J["Agent — Smoke Tests"]
  J --> K{Success}
  K -->|No| F
  K -->|Yes| L["Human — Critical Decisions Only"]

A researcher discusses an experimental idea with the agent. The agent handles literature review, tracks experiment specs, pipelines data, launches experiments. During runs, M2.7 monitors progress, reads logs, triggers debugging, analyzes metrics, and submits code fixes with merge requests. Human researchers only intervene for critical decisions.

M2.7 handles 30-50% of that workflow autonomously. The feedback loop runs continuously: agent collects its own evaluation data, builds internal task sets, then iteratively updates its own architecture, skills, and memory mechanisms.

MLE-Bench Lite: 22 ML Competitions

MiniMax tested M2.7 through 22 ML competitions on OpenAI’s MLE-Bench Lite, each running on a single A30 GPU. The harness used three components: short-term memory, self-feedback, and self-optimization.

MLE-Bench Lite Optimization Process

Three-trial experiment with 24 hours each: memory + self-feedback driving continuous improvement
View diagram source
flowchart TD
  A[Iteration N] --> B[Generate Memory Markdown]
  A --> C[Self-Criticism on Results]
  B --> D[Optimization Directions]
  C --> D
  D --> E["Iteration N+1"]
  E --> F{24h Trial}
  F --> G[Medal Rate Improves]
  G --> H{End of Trial}
  H -->|No| A
  H -->|Yes| I["Best Run — 9 Gold, 5 Silver, 1 Bronze"]

After each iteration, the agent generates a memory markdown file and performs self-criticism — providing optimization directions for the next round. The next round then conducts further self-optimization based on all previous memory and feedback.

The best run earned 9 gold, 5 silver, and 1 bronze — a 66.6% average medal rate across all runs. That ties Gemini 3.1 and trails only Opus 4.6 (75.7%) and GPT-5.4 (71.2%).

The weights never changed. Only the harness did.

Benchmark Performance

M2.7 activates only 10 billion parameters — making it the smallest model in tier-1 performance class. Here’s how it compares on key benchmarks:

BenchmarkM2.7Opus 4.6GPT-5.3
SWE-Pro56.22%~57%56.2%
SWE-bench Verified78%55%
VIBE-Pro (end-to-end)55.6%
Terminal Bench 257.0%
GDPval-AA (Office)1495 ELO
MLE-Bench Lite66.6%75.7%71.2%

M2.7 significantly outperforms Opus on SWE-bench Verified (78% vs 55%). It scores highest among open-source models on office productivity tasks. The 97% skill adherence rate across 40+ complex tasks (each exceeding 2,000 tokens) demonstrates reliable execution on intricate, multi-step workflows.

SWE-bench Verified at 78% is particularly notable — this benchmark tests real-world GitHub issues, not synthetic problems. M2.7 beating Opus 55% by 23 percentage points is a meaningful gap.

The Economics That Change Everything

But the real story is cost and speed:

MetricM2.7Claude Opus 4.6
Input cost$0.30/M$15/M
Output cost$1.20/M$75/M
Speed100 TPS~33 TPS
Activated params10B

M2.7 is 50x cheaper on input and 60x cheaper on output than Opus — while matching it on SWE-Pro. At 100 tokens per second, it’s 3x faster. With automatic cache optimization, the blended cost drops to $0.06 per million tokens.

For teams running high-volume agent workloads, coding assistants, or document processing pipelines, this cost structure changes what’s economically feasible. A task that cost $100 with Opus costs $2 with M2.7.

Why This Shifts the Deployment Calculus

The distinction is architectural: improvement without retraining means the optimization loop can run continuously in production, without downtime.

  • No multi-GPU training cycles — only code changes
  • No model versioning gymnastics — harness updates propagate immediately
  • Adaptation in hours, not weeks — new failure modes addressed in real-time

As agent systems proliferate, the bottleneck shifts from model capability to system design. If your harness can improve itself, the ceiling keeps moving upward without touching the weights.

The model executes. The harness is the product now.

Thread

0
⌘/Ctrl+Enter to sendType / for commands · Tab to @mention