Tech 7 min read

Linux Kernel 7.0 Drops PREEMPT_NONE, PostgreSQL Throughput Halved, Scheduler Maintainer Refuses to Revert

IkesanContents

On April 3, AWS engineer Salvatore Dipietro reported on LKML (Linux Kernel Mailing List) that PostgreSQL throughput dropped to roughly 51% of baseline on Linux kernel 7.0-rc. The cause: kernel 7.0 removed PREEMPT_NONE and PREEMPT_VOLUNTARY from major architectures, leaving only PREEMPT_LAZY and PREEMPT_FULL. Scheduler maintainer Peter Zijlstra (Intel) refused to revert the change, telling the PostgreSQL side to use the new “RSEQ timeslice extension” in kernel 7.0.

“Linux kernel 7.0” refers to the kernel version, not the OS itself (distributions like Ubuntu, Fedora, etc.). The stable release of kernel 7.0 is expected in mid-April, and it will ship with Ubuntu 26.04 LTS (scheduled for April 23). Rolling-release distributions like Fedora and Arch Linux will follow shortly after. This means PostgreSQL servers upgrading to these distributions could be directly impacted. Enterprise distributions like CentOS/RHEL/Amazon Linux backport kernels independently, so they won’t be affected immediately.

The Linux kernel has a longstanding rule: “don’t break userspace.” Linus Torvalds has stated this repeatedly. By that standard, the current response is highly unusual.

What Are Preemption Models?

Linux kernel preemption is the mechanism by which the kernel forcibly interrupts a running process to hand the CPU to another. Multiple models exist depending on when interruption is permitted.

ModelBehaviorUse Case
PREEMPT_NONENo preemption during kernel code executionServers. Maximum throughput
PREEMPT_VOLUNTARYPreempt only at explicit preemption pointsDesktop compromise
PREEMPT_LAZYPreempt only at timeslice boundaries (not immediately)New model introduced in Linux 6.13
PREEMPT_FULLPreempt anytimeReal-time priority

PREEMPT_NONE was widely used for server workloads. Since the kernel never preempts a running process, the risk of interrupting userspace critical sections (regions under mutual exclusion) was minimized.

In Linux 7.0, as part of Zijlstra’s long-term effort to simplify the preemption model, PREEMPT_NONE and PREEMPT_VOLUNTARY were removed from major architectures including x86, ARM64, LoongArch, PowerPC, RISC-V, and s390. PREEMPT_LAZY (introduced in Linux 6.13) is positioned as the successor to PREEMPT_NONE, but as the name implies, it only “delays” preemption — preemption itself still occurs.

Why PostgreSQL Is Hit So Hard

PostgreSQL forks a new process for each connection. When these processes access the shared buffer pool (shared_buffers), they use a userspace spinlock (s_lock) for mutual exclusion.

graph TD
    A[Client connections<br/>1024 concurrent] --> B[PostgreSQL processes<br/>fork per connection]
    B --> C[Access shared buffer pool]
    C --> D[Acquire s_lock spinlock]
    D --> E{PREEMPT_NONE}
    D --> F{PREEMPT_LAZY}
    E --> G[No preemption while<br/>holding lock]
    G --> H[Critical section<br/>completes immediately]
    F --> I[May be preempted while<br/>holding lock]
    I --> J[Other 95 processes<br/>spin-wait]
    J --> K[55% of CPU time spent<br/>on spinlock contention]

Under PREEMPT_NONE, a process holding a spinlock was never preempted by the kernel. Critical sections completed quickly, and the time other processes spent spinning on lock acquisition was minimal.

Under PREEMPT_LAZY, the kernel may preempt a process even while it holds a spinlock. On a 96-core server, when the lock holder is preempted, the remaining 95 processes fall into busy-wait, causing cascading contention. Profiling showed 55% of CPU time was consumed by PostgreSQL’s userspace spinlock (s_lock).

Page Faults Make It Worse

Minor page faults pour fuel on the fire. When a new connection forks and first touches shared memory, minor page faults occur. With the default 4KB pages, a flood of page faults happens while holding the spinlock, extending lock hold time far beyond expectations.

Using Huge Pages (1GB or 2MB pages) drastically reduces page fault frequency, significantly mitigating the regression. However, enabling Huge Pages often requires elevated privileges in container environments, so it’s not a universal workaround.

Benchmark Results

Dipietro’s reported benchmark environment and results:

ParameterDetail
InstanceAWS EC2 m8g.24xlarge
CPUGraviton4 (ARM64) 96 vCPU
PostgreSQLVersion 17
Benchmarkpgbench simple-update
Clients1024
Threads96
Duration1200 seconds
Kernel ConfigurationTPS (Transactions/sec)Ratio
Linux 7.0 (PREEMPT_LAZY)~50,7510.51x
PREEMPT_NONE restore patch~98,5651.0x (baseline)

A ~49% throughput drop. Dipietro submitted a revert patch, but Zijlstra rejected it.

The Developer Conflict

The kernel developers’ response is what makes this contentious.

Zijlstra’s position is clear: “PostgreSQL should use RSEQ timeslice extension.” RSEQ (Restartable Sequences) was originally a kernel facility for per-CPU data access, but Linux 7.0 added a “timeslice extension” feature. Before entering a critical section, an application can request a preemption grace period from the scheduler via RSEQ.

graph TD
    A[Application] --> B["rseq::slice_ctrl::request = 1<br/>(request grace period)"]
    B --> C[Execute critical section]
    C --> D{Kernel attempts<br/>preemption}
    D --> E["Grace period granted<br/>(default 5us extension)"]
    E --> F[Critical section completes]
    F --> G["rseq::slice_ctrl::request = 0"]

PostgreSQL committer Andres Freund pushed back on LKML:

  • The workaround for a bug introduced in 7.0 is a new feature from 7.0 — how does that make sense?
  • PostgreSQL supports a wide range of kernel versions and operating systems. RSEQ support only works on Linux 7.0+.
  • Integrating a kernel-specific low-level API into the spinlock implementation is a major design change, and backporting it to existing major versions (PostgreSQL 14-17) is unrealistic.

Freund’s remark — “requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great” — captures the structural problem succinctly.

As of April 5, Linus Torvalds has not commented on the thread.

This Has Happened Before

This isn’t the first time the Linux scheduler has broken PostgreSQL.

YearKernelIssueOutcome
2008Linux 2.6.23+CFS (Completely Fair Scheduler) introductionPostgreSQL performance regression
2012Linux 3.6-rcMike Galbraith’s scheduling optimizationpgbench dropped 20%. Reverted before final release
2026Linux 7.0-rcPREEMPT_NONE removal49% throughput drop. Revert refused

In the Linux 3.6 case in 2012, a similar pgbench performance regression was reverted before the final release. LWN.net covered it in detail. That time, the revert followed Linus’s userspace protection policy. This time, the maintainer is refusing to revert — a key difference.

The structural issue remains unchanged: PostgreSQL’s architecture (process model, shared memory, userspace spinlocks) is exceptionally sensitive to scheduler behavior.

Impact Scope

Upcoming Release Schedule

As noted above, the stable release of kernel 7.0 is expected in mid-April (estimated around April 13). The concern is Ubuntu 26.04 LTS releasing on April 23 shortly after — many enterprise PostgreSQL deployments run on Ubuntu LTS. If the regression ships in the LTS kernel, server administrators will need to apply workarounds themselves until either the distribution patches the kernel or PostgreSQL adds RSEQ support.

Impact on Other Databases

MySQL/InnoDB uses futex-based mutexes and has lower dependency on userspace spinlocks than PostgreSQL. However, other applications using userspace spinlocks certainly exist, and no systematic testing has been done. In environments with high core counts (96+ vCPUs), ARM64 architecture, and no Huge Pages, databases beyond PostgreSQL could be affected.

Current Workarounds

The confirmed workaround is enabling Huge Pages.

# Enable Huge Pages (e.g., allocating 8GB)
echo 4096 > /proc/sys/vm/nr_hugepages
# postgresql.conf
huge_pages = on

Using 1GB pages drastically reduces page fault frequency, largely eliminating the page-fault-during-spinlock problem. As mentioned, however, Huge Pages may not be configurable in container environments or managed services.

PostgreSQL’s development trunk has already completed a refactoring of BufFreeListLock (the spinlock protecting the buffer free list) that addresses the root cause. But this is only for future major versions — it does not apply to the current PostgreSQL 14-17.