AFM 3: 20B sparse on-device, Cloud Pro on Google Cloud, five model tiers

Apple released the third generation of Apple Foundation Models (AFM) on June 8, 2026.
This is not just “a slightly better on-device small model.” AFM 3 includes a 20B-parameter on-device sparse model, an image model on Private Cloud Compute, and NVIDIA GPUs on Google Cloud.

The naming gets a bit tangled.
There are five variants—AFM 3 Core, AFM 3 Core Advanced, AFM 3 Cloud, ADM 3 Cloud (Image), and AFM 3 Cloud Pro—each with different execution targets and roles under Apple Intelligence.

Five models, split by execution target

According to Apple, AFM 3 is a model family built in collaboration with Google.
Apple Security Research’s post says they used “the technology behind Google’s Gemini family” to build the next generation of Apple Foundation Models.

Here is the published breakdown:

Model	Runs on	Role
AFM 3 Core	On-device	Next-gen 3B dense model
AFM 3 Core Advanced	On-device	20B sparse model. Voice, transcription, multimodal
AFM 3 Cloud	PCC	Standard server-side model
ADM 3 Cloud (Image)	PCC	Image generation, editing, Genmoji, Image Playground
AFM 3 Cloud Pro	PCC on Google Cloud	Agentic tool use, complex reasoning

Core, Core Advanced, Cloud, and ADM 3 Cloud (Image) are optimized for Apple silicon.
Cloud Pro alone targets NVIDIA GPUs, running on Google Cloud with PCC extended to that infrastructure.

The 2026-generation public names confirmed from Apple’s primary sources are AFM 3 Core, AFM 3 Core Advanced, AFM 3 Cloud, ADM 3 Cloud (Image), and AFM 3 Cloud Pro.
”Server” appears when comparing with the 2025 generation or describing server-side model roles, but the 2026 generation uses “Cloud.”
There is no standalone public model name “AFM 3 Safety” in the published materials as of this writing. Safety is described through Responsible AI, safety taxonomy, language-specific guardrail models, and human red teaming with native speakers.

A diagram of the execution targets:

flowchart TD
    U["User request"] --> R{"OS / API routing"}
    R --> D["On-device<br/>AFM 3 Core / Core Advanced"]
    R --> A["PCC: Apple silicon<br/>AFM 3 Cloud / ADM 3 Cloud (Image)"]
    R --> G["PCC on Google Cloud<br/>AFM 3 Cloud Pro"]
    D --> D1["3B dense or<br/>1B-4B activated from 20B sparse"]
    A --> A1["Standard server inference<br/>Image generation / editing"]
    G --> G1["NVIDIA GPU<br/>agentic tool use / complex reasoning"]
    A -.-> P["PCC guarantees<br/>stateless / no privileged runtime access<br/>verifiable transparency"]
    G -.-> P

Until this year, Apple Intelligence centered on on-device processing and PCC on Apple silicon servers.
This announcement adds Google Cloud as a PCC execution target.
Apple says it will retain software control, cryptographic signing, binary publication, and researcher verification frameworks.

The 20B on-device sparse model is not offloaded to Google Cloud.
AFM 3 Core Advanced is a device-side model that stores all weights on NAND and loads selected subsets into DRAM per request.
Google Cloud NVIDIA GPUs serve as the execution substrate for AFM 3 Cloud Pro.
In other words, this is not “the device can’t fit 20B so it goes to Google Cloud.” Two separate tracks exist: raising the on-device ceiling, and extending the heaviest server inference through PCC.

20B without putting it all in DRAM

The distinctive part of AFM 3 Core Advanced is that it handles 20B parameters on-device without keeping all of them resident in DRAM.

A standard dense LLM loads all weights into memory during inference.
Even with MoE, because the selected experts change per token, the referenced weights need to sit in a fast memory tier.
Apple’s design stores the full model on NAND flash, selects a fixed set of experts at the prompt processing stage, and loads them into DRAM.
The selection is periodically refreshed during generation, but this is not a per-token weight-swap design.

Apple describes this as a sparse-activation architecture based on Instruction-Following Pruning (IFP).
Core Advanced holds all 20B parameters but activates only 1B to 4B per request.
Always-active shared experts and input-dependent routed experts are combined to keep the DRAM footprint down.

For context on running large models on a Mac, I previously tested LFM2.5 1.2B JP on M1 Max 64GB, where the model was small enough that memory was not the bottleneck.
AFM 3 Core Advanced is the opposite: to put a 20B-class model on-device, Apple designed where the weights live and how they are loaded.
This is quite different from the usual local LLM approach of “quantize and fit everything in memory.”

graph TD
    A["User input"] --> B["Lightweight dense block"]
    B --> C["Per-prompt<br/>expert selection"]
    C --> D["20B weights on NAND"]
    D --> E["Selected weights<br/>loaded into DRAM"]
    E --> F["Combined with shared experts"]
    F --> G["Inference at 1B-4B equivalent"]
    G --> H["Re-select experts<br/>during generation if needed"]

This approach is close to using phone and Mac storage bandwidth as an inference resource.
Apple explicitly states that NAND bandwidth is too slow for per-token weight swapping.
Latency is cut by selecting per-prompt and limiting swap frequency during generation.

Cloud Pro alone goes to Google Cloud and NVIDIA GPUs

AFM 3 Cloud Pro is positioned as the top-tier server model for agentic tool use and complex reasoning.
This is the only model that runs on NVIDIA GPUs on Google Cloud rather than Apple silicon.

Apple Security Research’s post lists NVIDIA Confidential Computing, Intel TDX, and Google Titan chip as implementation components for PCC on Google Cloud.
It also lists five PCC requirements: stateless computation, enforceable guarantees, no privileged runtime access, non-targetability, and verifiable transparency.

Apple is not just “using Google Cloud.” They are bringing PCC’s verifiability guarantees into Google Cloud.
Apple explains that Google Cloud hardware entering the PCC fleet is managed through a cryptographically verifiable append-only ledger.
User devices are configured to trust only PCC software cryptographically approved by Apple.
Binary publication, researcher tools, and research-mode nodes via the Security Bounty Program are planned.

Apple also writes that PCC on Google Cloud is not fully implemented yet, and protections will be added incrementally during the summer preview period.
Both Cloud Pro’s inference quality and PCC verification tooling remain to be released during the preview period.

The name “Google” appears in three contexts

Google’s name comes up repeatedly in this announcement, making it easy to conflate them.
Reading the primary sources, at least three distinct contexts emerge.

Context	What’s Google	What can be said now
AFM 3 development collaboration	Technology behind the Gemini family	Apple says AFM 3 was built jointly with Google. This does not state that user inference targets are Gemini itself.
PCC execution infrastructure	NVIDIA GPUs on Google Cloud	PCC is extended to Google Cloud for AFM 3 Cloud Pro. Apple says it maintains PCC software control and device-side cryptographic approval.
Developer API	Gemini via Firebase Apple SDK	Google connects Gemini to Foundation Models framework’s LanguageModel protocol. This is a separate cloud model delivery path from Apple Foundation Models.

Mixing these three leads to misreadings like “all of Apple Intelligence calls Gemini,” “the 20B model runs on Google Cloud,” or “PCC means Apple silicon only.”
In the published information, Cloud Pro’s Google Cloud execution, Core Advanced’s on-device execution, and Gemini’s developer-facing connection are separate topics.

The image model is ADM 3 Cloud, not AFM

Image generation and editing ship under a different name: ADM 3 Cloud (Image), not AFM 3 Cloud.
It powers Image Playground, Genmoji, Photos Spatial Reframing, touch-based editing, and personalization.

Apple says ADM 3 Cloud (Image) natively handles image generation, editing, and Genmoji, using dedicated adapters for downstream editing experiences.
It also adapts to different aspect ratios and resolutions.
The 2025 Image Playground had limited use cases, but the current description centers on photo editing and practical generation.

Google has also announced that Apple developers will be able to call Gemini from the Foundation Models framework.
In a recent post on Gemma 4 12B Unified, I wrote about Google pushing small, unified multimodal designs.
ADM 3 Cloud (Image) is not a locally distributed model. Under an OS-level image generation API, the developer’s code cannot tell which image model is being used.

Foundation Models framework approaches a single API

WWDC26’s Developer Guide explains that Foundation Models framework now handles not just Apple Foundation Models but also Claude, Gemini, and other LanguageModel protocol-compliant providers.
On-device Apple models, Apple models on Private Cloud Compute, and cloud-based external models are brought under the same API surface.

Google’s announcement says iOS 27, macOS 27, iPadOS 27, visionOS 27, and watchOS 27 will support connecting Gemini to Foundation Models framework via Firebase Apple SDK.
Using Firebase AI Logic and Firebase App Check, apps can call Gemini without a dedicated backend.

Apple Developer Guide states that apps in the App Store Small Business Program with fewer than 2 million cumulative first-time downloads can use next-generation Apple Foundation Models on PCC at no cloud API cost.
Whether this condition survives into production unchanged is uncertain, but if small-scale apps can try Apple’s PCC models without server costs, initial API adoption costs drop.

Apple’s PCC documentation describes the developer-facing type as PrivateCloudComputeLanguageModel.
The on-device SystemLanguageModel works offline with a 4K context window.
PCC requires a network connection, has a daily usage cap, a 32K context, and three reasoning levels: light, moderate, and deep.
The API framing is less “pick AFM 3 Cloud Pro directly” and more “choose PCC and raise the reasoning level if needed.”
The internal routing among Cloud, Cloud Pro, and ADM 3 Cloud (Image), and which features escalate to Cloud Pro, cannot be fully determined from the published documentation alone.

App Intents is expanding in parallel.
Siri AI, Spotlight’s semantic index, and the View Annotations API for referencing on-screen elements share the same API surface.
What determines implementation effort is not the model itself, but how much of the app’s data and actions are exposed to Siri and Apple Intelligence.

Safety is an operational story, not a model name

No standalone AFM 3 Safety model appears in the published materials.
Safety is structured as classification, alignment, and guardrails applied across the model family.
Apple’s Responsible AI description lists safety taxonomy, multilingual post-training alignment, language-specific guardrail models, and human red teaming with native speakers for supported locales.

On the Foundation Models framework side, how apps handle generation failures and guardrail violations is a real implementation concern.
PCC in particular stacks additional API errors: network failure, daily cap, unsupported device, Apple Intelligence disabled.
Implementation includes not just model performance but how the app recovers when a guardrail stops generation.

Still waiting on the technical report

Apple says a technical report and updated evaluations will come later in summer 2026.
Published evaluations so far center on Apple’s internal human evaluation.

AFM 3 Core scored 45.6% vs 23.3% preference against the 2025 Core model on general text.
AFM 3 Cloud scored 64.7% vs 8.7% against the 2025 Server model.
Cloud Pro shows roughly 10% relative improvement on text and 14% on image understanding over Cloud in overall response satisfaction.
For TTS, Core Advanced achieved MOS 4.15 vs 3.87 for the existing production TTS, and 4.24 vs 3.82 for conversational style.

External benchmarks, model card-level details, per-model context lengths, API constraints, and device-level support remain thin.
Core Advanced in particular is described only as “unlocked on the most powerful Apple silicon systems,” without specifying which iPhones, iPads, or Macs qualify.
For developers, the real differentiator is not model names but which execution target gets routed, which API is available, and what data leaves the device.