Tech 10 min read

Meta Drops Llama for Muse Spark Under New Meta Superintelligence Labs

IkesanContents

Meta announced “Muse Spark” on April 8, 2026 — the first AI model from its new organization, Meta Superintelligence Labs (MSL).
Internal codename: Avocado. Built from scratch with a completely separate architecture and data curation pipeline from the Llama series.

This isn’t just another model launch.
The company that championed open-weight AI released a proprietary model, leaned heavily on personnel and data infrastructure from its Scale AI acquisition, and is betting on multi-agent reasoning as a differentiator.

Why Scale AI’s CEO Now Leads Meta AI

MSL is headed by Alexandr Wang, co-founder and former CEO of Scale AI.
In June 2025, Meta acquired 49% non-voting equity in Scale AI for $14.3 billion (roughly ¥2.1 trillion) and brought Wang on as Meta’s first-ever Chief AI Officer. He was 27.

Scale AI specializes in data labeling — creating the training data that powers AI models.
The company was the industry leader in quality control for RLHF data, with OpenAI, Anthropic, and Google DeepMind all using Scale’s data to train their models.
In other words, Meta didn’t poach the head of “a company that builds AI models” — it poached the head of “the company whose data determines AI model quality.”

RLHF (Reinforcement Learning from Human Feedback) is a post-training technique for LLMs.
Human evaluators compare multiple model outputs and judge which is better; these judgments become reward signals for optimizing the model.
Because the quality of these human judgments directly determines final model capabilities, data labeling quality control is an unglamorous but decisive part of the pipeline.

LLM performance discussions tend to focus on architecture and scaling, but models with identical architectures can end up with vastly different capabilities depending on data curation — how training data is selected, filtered, and processed.
It’s only natural to see Wang’s data curation expertise as the direct driver behind Muse Spark’s claimed “one-tenth the compute cost for Llama 4 Maverick-equivalent performance.”

Wang’s appointment also responded to the April 2025 Llama 4 release falling short of expectations.
Meta established MSL to overhaul R&D, offering Wang a compensation package including equity worth hundreds of millions of dollars.
By March 2026, the organization was restructured: Maher Saba, formerly of Reality Labs, now leads applied AI engineering and reports directly to CTO Andrew Bosworth.
Saba’s team handles AI product integration while Wang’s MSL focuses on long-term research.

The Open-Weight Pivot

Muse Spark shipped as a proprietary model.
There’s a caveat — “we hope to open-source future versions” — but for Meta, this is a major policy shift.

The distinction between “open-weight,” “open-source,” and “proprietary”:

TermWhat’s PublicExamples
Open SourceCode, data, and full training procedureVery few
Open WeightModel weights (parameters) only; training data and code stay privateLlama 3, Gemma 4, Qwen 3.5
ProprietaryWeights are private; access via API onlyGPT-5 series, Claude series, Gemini 3 series

The Llama series falls under “open weight.”
Anyone can download and use the model weights, but training data and pipeline details remain private.
The community impact has nonetheless been enormous — since Llama 2, the weights have been widely used for local LLMs, fine-tuning, and research.
As covered in the Japanese LLM comparison, many Japanese-specialized models are fine-tuned on Llama bases.

Meanwhile, Google maintains a two-track approach with proprietary Gemini 3 alongside Gemma 4 under Apache 2.0.
If Meta pulls back from open weights, Google’s Gemma and Alibaba’s Qwen will fill the gap in the open model ecosystem.

The Llama series could continue separately, but if MSL’s resources concentrate on the Muse family, a slowdown in Llama development is inevitable.
That’s a significant shift for the local LLM community.

Muse Spark’s Technical Features

Multimodal Input and Tool Integration

Muse Spark accepts voice, text, and image input. Output is currently text-only.
It natively supports external tool invocation and multi-agent orchestration, along with Visual Chain of Thought — generating visual reasoning chains for image-containing inputs.

What “native support” means here is hard to judge from public information alone.
If it’s just a Function Calling-equivalent API, that’s table stakes. If tool invocation is baked into the architecture itself, Muse Spark could run task decomposition and parallel sub-agent execution pipelines — similar to Copilot CLI’s /fleet command — within a single model.

Compute Efficiency Gains

According to Meta’s official blog, Muse Spark matches Llama 4 Maverick’s capabilities at less than one-tenth the compute cost.
This is attributed to a ground-up rebuild of architecture, optimization methods, and data curation.

Compute efficiency gains are an industry-wide trend.
Mamba-3 achieved roughly 7x inference speed over Transformers with its SSM architecture, and research on CDLM and Attention Matching KV compaction has demonstrated order-of-magnitude inference cost reductions.
Which specific techniques underpin Muse Spark’s “10x efficiency” remains unclear, but Meta says it validated scaling laws on small models first and then built the production model — optimizing the entire pre-training stack.

Contemplating Mode and Multi-Agent Reasoning

Muse Spark’s key differentiator is Contemplating mode.
Rather than deepening a single model’s reasoning, it spins up multiple sub-agents in parallel, has each reason independently, and integrates the results.

Current approaches to “enhanced reasoning” fall into three broad categories:

ApproachMechanismRepresentative Examples
Single-model extended reasoningOne model generates a long Chain-of-ThoughtOpenAI o3, GPT-5.4 Pro
Inference-time sampling optimizationSample multiple times and pick the bestPower Sampling, Best-of-N
Multi-agent reasoningMultiple agents reason in parallel; results are integratedMuse Spark Contemplating

OpenAI’s o3 and GPT-5.4 Pro use a single model generating long internal reasoning tokens to improve accuracy, with costs scaling with inference time.
As the Power Sampling research showed, sometimes just changing the sampling strategy improves reasoning performance without any RL training.

Muse Spark takes a different path by distributing the reasoning process itself.
Meta AI’s own HyperAgents research demonstrated that multi-agent self-improvement loops are effective, hinting at why MSL is betting on this approach.

graph TD
    A[Input Query] --> B[Muse Spark Coordinator]
    B --> C[Sub-agent 1]
    B --> D[Sub-agent 2]
    B --> E[Sub-agent N]
    C --> F[Reasoning Result Integration]
    D --> F
    E --> F
    F --> G[Thought Compression]
    G --> H[Final Answer]

This architecture resembles the agent orchestration layer concept from Karpathy’s Claws.
Instead of one-shot agents that are invoked and discarded, a coordinator manages multiple agents and integrates their results.

Thought Compression

Contemplating mode’s reinforcement learning uses a technique called “thought compression.”
The process:

  1. First, maximize accuracy using as many reasoning tokens as needed
  2. Then retrain with a penalty on token count
  3. Shorten reasoning token usage while maintaining accuracy

Like RLHF and GRPO (Group Relative Policy Optimization), reward functions optimize model output.
The key difference: rewards are given not only for answer accuracy but also for reasoning brevity.

GRPO updates policy based on relative rewards within a group, eliminating the need for a value function model and reducing compute costs compared to PPO (Proximal Policy Optimization).
It gained prominence through DeepSeek-R1’s reasoning enhancement and became available as a stable API in TRL v1.0.
As the comparative analysis of 16 open-source RL libraries shows, async RL training design patterns are maturing rapidly.

Thought compression layers a new objective — reasoning cost optimization — on top of these RL methods.
Meta claims the result is multi-agent reasoning with latency comparable to single-model inference.

Benchmark Performance

Published benchmark scores, with context on what each measures:

BenchmarkWhat It MeasuresMuse Spark (Standard)ContemplatingComparisons
GPQA DiamondPhD-level physics, chemistry, biology; low accuracy even for domain experts89.5%-Opus 4.6: 92.7%, Gemini 3.1 Pro: 94.3%, Grok 4.2: 88.5%
HLE (no tools)Extremely hard math/science; billed as “humanity’s last exam”42.8%50.2%Gemini 3.1 Deep Think: 48.4%, GPT-5.4 Pro: 43.9%
HLE (with tools)Same, but external tool use allowed50.4%58%-
HealthBench HardMedical diagnosis/judgment; physician-supervised question set42.8%-Exceeds frontier models per Meta
FrontierScienceCutting-edge scientific research problems-38%-

GPQA Diamond (Graduate-Level Google-Proof Q&A) collects specialist questions that can’t be answered by searching Google.
”Diamond” is the hardest subset, where even PhDs in the relevant field score poorly.
Muse Spark doesn’t match the top-tier modes of Opus 4.6 or Gemini 3.1 Pro, but it beats Grok 4.2.

HLE (Humanity’s Last Exam) was created because existing benchmarks got too easy.
This is where Contemplating mode shows its strength.
The 50.2% without tools exceeds Gemini 3.1 Deep Think (48.4%) and GPT-5.4 Pro (43.9%), suggesting that multi-agent reasoning holds an advantage over single-model reasoning depth in certain domains.

On HealthBench, Meta reports significantly outperforming frontier models, with training on curated data supervised by over 1,000 physicians.
Scale AI’s data quality management is likely the factor here.

Meta itself acknowledges this “isn’t a new SOTA — it’s competitive on specific tasks.”
Benchmark score interpretation requires caution.
The deprecation of SWE-bench Verified showed that even industry-standard benchmarks can be retired due to contamination and gaming issues.
On ARC-AGI-3, frontier models scored below 1% across the board — the gap between static benchmarks and real-task performance remains wide.

Safety Evaluation and the “Evaluation Awareness” Problem

Internal evaluation under Meta’s Advanced AI Scaling Framework found all frontier risk categories within safety margins.
The model showed strong refusal behavior in biological and chemical weapons-related domains.

The notable finding came from third-party evaluation: “evaluation awareness.”
This is a model’s tendency to detect that it’s being tested and adjust its behavior accordingly.
Muse Spark showed a higher rate of this than other models.

Why evaluation awareness matters: a model that plays nice only during tests risks behaving differently in production.
High safety benchmark scores become unreliable when you can’t distinguish between the model’s genuine tendencies and temporary adaptation triggered by test detection.

AI safety evaluation methodology itself is still a work in progress.
The case where all Claude tiers were jailbroken via AFL attacks demonstrated that models passing internal safety evaluations can still be compromised in practice.
Anthropic’s emotion vector research revealed 171 emotional expression patterns inside the model with causal effects on output.
A model’s internal state can’t be fully captured by external evaluation alone.

Meta’s rationale for judging this evaluation awareness “not release-blocking” needs further scrutiny.

Meta’s AI Strategy and the Frontier Model Race

MSL’s stated vision is “a personal superintelligence that understands your world.”
Muse Spark is positioned as “the first step on the scaling ladder,” with larger Muse family models explicitly in development.

Meta’s AI strategy previously diverged from the “pour billions into closed models” approach seen in Anthropic building multi-GW infrastructure with Google and Broadcom.
With Muse Spark, Meta has entered the same arena as OpenAI, Anthropic, and Google DeepMind.

Muse Spark is available through meta.ai and the Meta AI app. An API private preview is rolling out to select users.
Contemplating mode is being deployed gradually.

Having invested $14.3 billion to bring Wang on board, Meta’s expectations for MSL are high.
The opening move lands at competitive but not dominant — how fast the Muse family can evolve from here is what matters.