Linux Kernel 7.0 Drops PREEMPT_NONE, PostgreSQL Throughput Halved, Scheduler Maintainer Refuses to Revert

TL;DR

Affected PostgreSQL 14–17 running on Linux kernel 7.0 (Ubuntu 26.04 LTS / Fedora 44 / Arch). Severe on high-core ARM64

Workaround Set huge_pages = on — page faults drop sharply and the regression mostly disappears

Decision aid RHEL / Amazon Linux backport kernels are unaffected for now. Hold off on Ubuntu 26.04 LTS upgrades until the dust settles

On April 3, AWS engineer Salvatore Dipietro reported on LKML (Linux Kernel Mailing List) that PostgreSQL throughput dropped to roughly 51% of baseline on Linux kernel 7.0-rc. The cause: kernel 7.0 removed PREEMPT_NONE and PREEMPT_VOLUNTARY from major architectures, leaving only PREEMPT_LAZY and PREEMPT_FULL. Scheduler maintainer Peter Zijlstra (Intel) refused to revert the change, telling the PostgreSQL side to use the new “RSEQ timeslice extension” in kernel 7.0.

“Linux kernel 7.0” refers to the kernel version, not the OS itself (distributions like Ubuntu, Fedora, etc.). The stable release of kernel 7.0 is expected in mid-April, and it will ship with Ubuntu 26.04 LTS (scheduled for April 23). Rolling-release distributions like Fedora and Arch Linux will follow shortly after. This means PostgreSQL servers upgrading to these distributions could be directly impacted. Enterprise distributions like CentOS/RHEL/Amazon Linux backport kernels independently, so they won’t be affected immediately.

The Linux kernel has a longstanding rule: “don’t break userspace.” Linus Torvalds has stated this repeatedly. By that standard, the current response is highly unusual.

What Are Preemption Models?

Linux kernel preemption is the mechanism by which the kernel forcibly interrupts a running process to hand the CPU to another. Multiple models exist depending on when interruption is permitted.

Model	Behavior	Use Case
PREEMPT_NONE	No preemption during kernel code execution	Servers. Maximum throughput
PREEMPT_VOLUNTARY	Preempt only at explicit preemption points	Desktop compromise
PREEMPT_LAZY	Preempt only at timeslice boundaries (not immediately)	New model introduced in Linux 6.13
PREEMPT_FULL	Preempt anytime	Real-time priority

PREEMPT_NONE was widely used for server workloads. Since the kernel never preempts a running process, the risk of interrupting userspace critical sections (regions under mutual exclusion) was minimized.

In Linux 7.0, as part of Zijlstra’s long-term effort to simplify the preemption model, PREEMPT_NONE and PREEMPT_VOLUNTARY were removed from major architectures including x86, ARM64, LoongArch, PowerPC, RISC-V, and s390. PREEMPT_LAZY (introduced in Linux 6.13) is positioned as the successor to PREEMPT_NONE, but as the name implies, it only “delays” preemption — preemption itself still occurs.

Why PostgreSQL Is Hit So Hard

PostgreSQL forks a new process for each connection. When these processes access the shared buffer pool (shared_buffers), they use a userspace spinlock (s_lock) for mutual exclusion.

graph TD
    A[Client connections<br/>1024 concurrent] --> B[PostgreSQL processes<br/>fork per connection]
    B --> C[Access shared buffer pool]
    C --> D[Acquire s_lock spinlock]
    D --> E{PREEMPT_NONE}
    D --> F{PREEMPT_LAZY}
    E --> G[No preemption while<br/>holding lock]
    G --> H[Critical section<br/>completes immediately]
    F --> I[May be preempted while<br/>holding lock]
    I --> J[Other 95 processes<br/>spin-wait]
    J --> K[55% of CPU time spent<br/>on spinlock contention]

Under PREEMPT_NONE, a process holding a spinlock was never preempted by the kernel. Critical sections completed quickly, and the time other processes spent spinning on lock acquisition was minimal.

Under PREEMPT_LAZY, the kernel may preempt a process even while it holds a spinlock. On a 96-core server, when the lock holder is preempted, the remaining 95 processes fall into busy-wait, causing cascading contention. Profiling showed 55% of CPU time was consumed by PostgreSQL’s userspace spinlock (s_lock).

Page Faults Make It Worse

Minor page faults pour fuel on the fire. When a new connection forks and first touches shared memory, minor page faults occur. With the default 4KB pages, a flood of page faults happens while holding the spinlock, extending lock hold time far beyond expectations.

Using Huge Pages (1GB or 2MB pages) drastically reduces page fault frequency, significantly mitigating the regression. However, enabling Huge Pages often requires elevated privileges in container environments, so it’s not a universal workaround.

Benchmark Results

Dipietro’s reported benchmark environment and results:

Parameter	Detail
Instance	AWS EC2 m8g.24xlarge
CPU	Graviton4 (ARM64) 96 vCPU
PostgreSQL	Version 17
Benchmark	pgbench simple-update
Clients	1024
Threads	96
Duration	1200 seconds

Kernel Configuration	TPS (Transactions/sec)	Ratio
Linux 7.0 (PREEMPT_LAZY)	~50,751	0.51x
PREEMPT_NONE restore patch	~98,565	1.0x (baseline)

A ~49% throughput drop. Dipietro submitted a revert patch, but Zijlstra rejected it.

The Developer Conflict

The kernel developers’ response is what makes this contentious.

Zijlstra’s position is clear: “PostgreSQL should use RSEQ timeslice extension.” RSEQ (Restartable Sequences) was originally a kernel facility for per-CPU data access, but Linux 7.0 added a “timeslice extension” feature. Before entering a critical section, an application can request a preemption grace period from the scheduler via RSEQ.

graph TD
    A[Application] --> B["rseq::slice_ctrl::request = 1<br/>(request grace period)"]
    B --> C[Execute critical section]
    C --> D{Kernel attempts<br/>preemption}
    D --> E["Grace period granted<br/>(default 5us extension)"]
    E --> F[Critical section completes]
    F --> G["rseq::slice_ctrl::request = 0"]

PostgreSQL committer Andres Freund pushed back on LKML:

The workaround for a bug introduced in 7.0 is a new feature from 7.0 — how does that make sense?
PostgreSQL supports a wide range of kernel versions and operating systems. RSEQ support only works on Linux 7.0+.
Integrating a kernel-specific low-level API into the spinlock implementation is a major design change, and backporting it to existing major versions (PostgreSQL 14-17) is unrealistic.

Freund’s remark — “requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great” — captures the structural problem succinctly.

As of April 5, Linus Torvalds has not commented on the thread.

This Has Happened Before

This isn’t the first time the Linux scheduler has broken PostgreSQL.

Year	Kernel	Issue	Outcome
2008	Linux 2.6.23+	CFS (Completely Fair Scheduler) introduction	PostgreSQL performance regression
2012	Linux 3.6-rc	Mike Galbraith’s scheduling optimization	pgbench dropped 20%. Reverted before final release
2026	Linux 7.0-rc	PREEMPT_NONE removal	49% throughput drop. Revert refused

In the Linux 3.6 case in 2012, a similar pgbench performance regression was reverted before the final release. LWN.net covered it in detail. That time, the revert followed Linus’s userspace protection policy. This time, the maintainer is refusing to revert — a key difference.

The structural issue remains unchanged: PostgreSQL’s architecture (process model, shared memory, userspace spinlocks) is exceptionally sensitive to scheduler behavior.

Impact Scope

Upcoming Release Schedule

As noted above, the stable release of kernel 7.0 is expected in mid-April (estimated around April 13). The concern is Ubuntu 26.04 LTS releasing on April 23 shortly after — many enterprise PostgreSQL deployments run on Ubuntu LTS. If the regression ships in the LTS kernel, server administrators will need to apply workarounds themselves until either the distribution patches the kernel or PostgreSQL adds RSEQ support.

Impact on Other Databases

MySQL/InnoDB uses futex-based mutexes and has lower dependency on userspace spinlocks than PostgreSQL. However, other applications using userspace spinlocks certainly exist, and no systematic testing has been done. In environments with high core counts (96+ vCPUs), ARM64 architecture, and no Huge Pages, databases beyond PostgreSQL could be affected.

Current Workarounds

The confirmed workaround is enabling Huge Pages.

# Enable Huge Pages (e.g., allocating 8GB)
echo 4096 > /proc/sys/vm/nr_hugepages

# postgresql.conf
huge_pages = on

Using 1GB pages drastically reduces page fault frequency, largely eliminating the page-fault-during-spinlock problem. As mentioned, however, Huge Pages may not be configurable in container environments or managed services.

PostgreSQL’s development trunk has already completed a refactoring of BufFreeListLock (the spinlock protecting the buffer free list) that addresses the root cause. But this is only for future major versions — it does not apply to the current PostgreSQL 14-17.