Splitting Correctness from Throughput in XLSX Streaming Writes
Contents
I came across “Splitting Correctness from Throughput: A Hybrid Approach to XLSX Streaming Writes” on DEV Community and it’s a good read.
The idea is to split XLSX writing into two concerns: not breaking as OOXML, and emitting many cells fast.
XLSX is a ZIP full of XML, so writing rows and cells by hand isn’t hard.
But once you start hand-rolling relationships, content types, namespaces, themes, and drawing rels, you end up with files that trip Excel’s repair dialog.
The original post draws the line right there.
Let Apache POI build a correct XLSX scaffold, and stream only <sheetData> — the part where iteration count blows up — with a StringBuilder.
XLSX is slow because of per-cell object creation
Apache POI’s XSSFWorkbook exposes cells, strings, and style references as objects.
POI handles correctness, but the entire workbook lives in memory, so large tables get heavy fast.
SXSSFWorkbook is POI’s streaming write API.
The official docs describe how it keeps only a window of rows in memory and flushes older ones, allowing huge spreadsheets.
But the per-cell API is still on POI’s model, so Cell and RichTextString allocation overhead remains.
The original benchmark uses 100K rows, mixed types, shared string table on, JMH-measured.
The numbers alone look like this.
| Approach | Time | Memory |
|---|---|---|
XSSFWorkbook via Jackson layer | 334 ms | 258 MB |
SXSSFWorkbook direct | 283 ms | 207 MB |
| scaffold-based hybrid | 150 ms | 191 MB |
47% write-time reduction vs. SXSSFWorkbook.
The memory delta is less dramatic, but it drops because per-cell wrapper objects aren’t allocated.
What’s getting faster here is not ZIP or XML processing as a whole.
It’s pulling the high-iteration cell creation off POI’s object model.
Reproducing on M1 Max
Before trusting the numbers, I checked if they reproduce locally.
Ran the WriteBenchmark shipped in scndry/jackson-dataformat-spreadsheet v1.4.1, straight through JMH.
4 columns (name/quantity/price/description), mixed-type, 100K rows, warmup 2 iter / measurement 3 iter / fork 1, JDK 21 (Homebrew openjdk 21.0.11), M1 Max.
| Benchmark | Time (ms/op) | What it is |
|---|---|---|
jacksonSpreadsheet | 253 | scaffold-hybrid (default SpreadsheetMapper) |
fastExcel | 268 | A separate fast XLSX library |
poiSXSSF | 437 | SXSSFWorkbook direct |
jacksonSpreadsheetPOI | 501 | scndry’s library on POI User Model (no hybrid) |
fesod | 532 | Reference value |
Absolute numbers don’t match the original (hybrid 150ms / SXSSF 283ms).
JMH iteration count was reduced, and the JVM and hardware differ.
The relative ratio holds though: original hybrid/SXSSF = 53%, M1 Max = 58%.
The claim “stream <sheetData> yourself and you halve the time of SXSSF” survived locally.
Inverting that: 100K rows in the 283ms ballpark on a typical JVM, halved with this optimization — that order of magnitude doesn’t shift much across machines either.
The same repository ships MemoryBenchmark and SharedStringsBenchmark if you want to dig into memory.
Let POI build the scaffold, swap only sheetData
The flow is simple.
flowchart TD
A["Generate empty XLSX with POI"] --> B["Get sheet1.xml"]
B --> C["Split around sheetData"]
C --> D["Copy head"]
D --> E["Stream rows and cells by hand"]
E --> F["Copy tail"]
F --> G["Copy other zip entries from POI output"]
In sheet1.xml, the bulk of row data lives inside <sheetData>.
Per ECMA-376, <sheetData> appears exactly once at a fixed position as a child of <worksheet>, so it’s a clean split point. That’s the original article’s claim.
The POI-generated head and tail XML pass through unchanged.
Only the contents of <sheetData> are replaced.
Cells get appended directly: for a string cell, look up the shared string table index and append something close to <c r="A1" t="s"><v>0</v></c>.
The split is well-chosen.
| Region | Owner | Why |
|---|---|---|
| workbook rels | POI | Compatibility minefield |
| styles.xml | POI | Style ID and reference integrity required |
| theme / drawings | POI | Excel-specific peripheral XML, easy to break |
| sheetData | hand-rolled | Per-row, per-cell iteration dominates with scale |
| sharedStrings.xml | hand-rolled streaming | Independent table that streams on its own |
You’re not writing all of OOXML by hand.
And you’re not staying on POI’s User Model all the way through.
The fragile fixed parts go to POI, the volume-dominant iteration goes hand-rolled, and the boundary follows the XML structure.
sharedStrings and the style table form the boundary
What I most want to verify with this approach is whether dependencies leak past <sheetData>.
String cells reference sharedStrings.xml.
The original implementation caches strings, hands out indices, and streams sharedStrings.xml independently.
Cells get lighter, but data with too many unique strings makes shared string table memory and I/O start to matter.
Styles are even more constrained.
The original article notes that the style table is fixed at scaffold creation time.
Adding new styles after row writing has started would require rewriting styles.xml, which the current implementation doesn’t support.
This constraint is easy to swallow for outputs like these.
| Good fit | Why |
|---|---|
| Standard reports | Styles can be decided up front |
| High-row exports | Cell creation dominates |
| Jackson POJO output | Schema is known up front |
| BI / admin “CSV-replacement XLSX” | Layout matters less than row count |
The opposite case — a report builder that adds styles, merge cells, shapes, and comments mid-stream based on user actions — breaks the boundary.
You end up reconciling references between POI-managed fixed XML and hand-emitted cell XML, which adds work.
Not for small reports
There’s a fixed cost to building the scaffold, so small XLSX files don’t benefit.
For a few hundred rows, POI’s User Model or SXSSFWorkbook straight up is easier to implement and verify.
This approach pays off when there are many cells and many same-shaped rows in a row.
For example: dropping 100K+ row export from a web app admin panel, generating daily batch reports, emitting spreadsheets from Jackson data binding in a Java service.
You also can’t just measure average response time.
The original 150ms-for-100K-rows is strong, but real data shifts with unique string count, column count, style count, ZIP compression, output storage, and how you stream into the HTTP response.
Especially with web downloads, even if XLSX generation gets fast, you’ll bottleneck elsewhere if you’re buffering the response in memory before returning.
The flip side of the read article
I wrote previously about converting Excel to Markdown with MarkItDown.
That was the read direction: turning Excel and PDF into LLM-ready text.
This XLSX streaming write is the opposite: emitting structured data as Excel-compatible files.
It’s also at a different point in the pipeline than Trimming human review of document extraction with confidence scores.
Document extraction asks “should we adopt this extracted value as business data?”
XLSX writing asks “can we emit already-adopted business data in volume without breaking when Excel opens it?”
Both reading and writing get painful if you treat Excel as just a table.
On input, cell appearance, reading order, and extraction provenance are involved.
On output, OOXML peripheral files, references, shared strings, and the style table are involved.
The original hybrid approach isn’t “fully understand the Excel file and optimize all of it”; it’s “leave the fragile parts to POI, only pull out the parts dominated by volume.”
That’s the usable bit.
Before reaching for full hand-written XML when XLSX output feels slow, it’s worth measuring whether <sheetData> is actually the hot path.
POI runs entirely in Java
Apache POI is a pure Java implementation.
No native code, no COM calls.
Since XLSX is ZIP-of-XML OOXML, you can build the whole file with just Java’s ZIP I/O and an XML parser.
It’s not orchestrating a hidden Excel process behind the scenes.
When people imagine “something like ActiveX,” they’re picturing Windows COM/OLE automation.
That’s where you instantiate Excel.Application from VBA or C# and pump cell values into a running Excel process.
Each cell write updates Excel’s internal state; saving makes Excel write the file.
Accurate but slow. Requires Excel on Windows, and standing up an Excel process on a server brings licensing problems.
POI avoids this entirely.
It builds XML DOM and streams directly in Java, following the OOXML spec defined in ECMA-376.
Files can be created without Excel, and it runs on Linux too.
The hybrid approach in this article also stays fully on the Java side: POI builds the scaffold, StringBuilder emits the XML strings.
JavaScript libraries for XLSX exist too — SheetJS and ExcelJS — but they’re separate projects, not POI.
The original scndry/jackson-dataformat-spreadsheet runs on Java 8+ with Apache POI 4.1.1+, on a pure-Java stack.
Things that may or may not break until you try
Since only the inside of <sheetData> is swapped, anything depending on what’s outside <sheetData> straddles the scaffold/stream boundary.
Whether Excel silently fixes it, shows the repair dialog, or fails to open at all varies by feature, and behavior differs across Excel, Google Sheets, and LibreOffice.
Of the items below, merge cells, XML escape, and date cells were verified locally by running scndry’s implementation and inspecting the resulting XLSX XML.
The rest (formulas, hyperlinks, autoFilter, data validation, tables, named ranges, comments) are inferred from the OOXML spec and source reading — not opened in Excel/Sheets/LibreOffice for visual confirmation.
Formula cells
Writing <f>A1+B1</f> in a cell is inside sheetData, so the stream can emit it directly.
But calcChain.xml is a POI-managed file, and the empty-sheet scaffold has no calc chain.
Excel recalculates on open, so results often display, but whether Google Sheets behaves the same is a separate question.
Beyond that, there’s no spec guarantee that formulas containing INDIRECT, OFFSET, or VOLATILE functions evaluate correctly without calcChain.
Hyperlinks
The displayed cell text goes into sheetData, but link target info appears in <hyperlinks> after </sheetData>.
The scaffold is an empty sheet, so this section is empty.
An implementation that copies the tail as-is either drops the URL info, leaving cells with just blue underlines, or does nothing visible at all.
Pre-seeding hyperlinks at scaffold time is possible, but row counts and link targets must be known up front.
AutoFilter
<autoFilter ref="A1:E50000"> comes after </sheetData>.
The final row count isn’t known when the scaffold is created.
You can rewrite ref after all rows are written — either by string-replacing the tail, or by inserting a dummy row count at scaffold time and overwriting later.
Plug in “around 50000 rows” and end up with 60000 rows, and the rows past the filter range overflow.
Excel sometimes auto-expands the range, sometimes doesn’t.
Merge cells and conditional formatting
Both <mergeCells> and <conditionalFormatting> are placed after sheetData.
Both carry reference ranges (ref or sqref), so they need to be aligned with the row positions emitted by the stream.
”Merge only the header row” works fine since it’s decided at scaffold time.
If merging happens dynamically inside data rows, you accumulate merge ranges in the stream and inject the XML into the tail after </sheetData> is written.
ECMA-376 fixes the order of <worksheet> child elements, so the insertion point for <mergeCells> follows that order.
scndry’s implementation already does this with SSMLSheetWriter#mergeScopedColumns accumulating ranges and _appendMergeCellsIntoSuffix writing them into the tail. Tail rewriting isn’t a special trick here; it’s the standard technique.
Data validation (dropdowns)
Dropdown lists and input restrictions specified by <dataValidation> are also outside sheetData.
You can define them in the scaffold, but if the sqref range and the actual row count from streaming don’t match, validation may apply only to some rows or reference cells that don’t exist.
Excel rarely throws errors, but whether the result matches intent isn’t knowable without opening the file.
Table definitions
Structured tables (the “Format as Table” feature) live as separate XML files outside sheetData.
xl/tables/table1.xml holds range, column names, and style info, referenced from sheet-side <tableParts> via relationships.
The empty scaffold has no table definitions, so users tabling the streamed data manually in Excel after the fact is fine.
The problem is when you want to define the table at scaffold time.
Set the table ref attribute to A1:E100 and end up with 200 actual rows.
Excel sometimes auto-expands the table range, sometimes doesn’t.
LibreOffice may ignore the table definition entirely and show plain cell data.
If the table column names disagree with the sheetData header row text, the repair dialog may or may not appear.
Named ranges and print areas
Named ranges declared via <definedNames> in workbook.xml are defined in a separate file from the sheet.
Print areas (_xlnm.Print_Area) and filter ranges (_xlnm._FilterDatabase) are also a kind of named range.
These are fixed at scaffold generation time.
Even when streaming adds more data rows, range definitions in workbook.xml aren’t updated.
With print area A1:E100 and 500 actual rows, only the first 100 rows print.
Users don’t notice until print preview, so it shows up as a “missing data” bug.
No repair dialog, no actual file corruption.
Just “incomplete.”
Comments and notes
Since Excel 365, cell-attached text comes in two flavors: “Comments” (threaded, for shared review) and “Notes” (the classic yellow sticky note).
Legacy comments live in xl/comments1.xml, referenced from sheet-side <legacyDrawing> via a VML file.
Threaded comments are managed in yet another XML.
Both are outside sheetData, and the empty scaffold has none of these files.
Adding comments or notes from the stream means appending XML manually and rewriting relationships.
The VML format is a relic among OOXML specs, with the boundary between “VML Excel accepts” and “VML that breaks Excel” effectively undocumented.
Even POI has recurring issues filed in the VML area.
Cell value XML escape misses
When you build cell XML by hand with StringBuilder, the quietest source of breakage is XML escaping.
Once a string cell value contains & or < and gets written without escaping, the XML itself is broken.
I checked what scndry’s implementation actually does by running various characters through it and inspecting xl/sharedStrings.xml.
| Input | Output |
|---|---|
Acme & Sons | Acme & Sons |
<script>alert(1)</script> | <script>alert(1)</script> |
"Quoted" | "Quoted" |
He said 'hi' | He said 'hi' |
All five are escaped (technically " and ' don’t need escaping inside text nodes, but it errs on the safe side).
I also tried control characters. 0x09 (TAB) / 0x0A (LF) / 0x0D (CR) — the ones XML 1.0 permits — pass through, while the rest are silently dropped.
| Input | Output |
|---|---|
0x09 (TAB) | preserved |
0x0A (LF) | preserved |
0x0D (CR) | preserved |
0x01, 0x07, 0x08, 0x0B, 0x0C, 0x0E, 0x1F | all dropped |
0x7F (DEL) | preserved (XML 1.0 allows it; invisible though) |
The _x0001_ form (OOXML’s _xHHHH_ escape) used by Excel itself is not produced.
If a control character sneaks in via a customer name or comment field read from CSV, no error is raised but the data quietly loses one character.
If you need the same string visible verbatim when the file opens in Excel, you have to add a layer that converts to the _xHHHH_ form before stream writing.
POI’s User Model handles this automatically, but writing directly with StringBuilder puts it on you.
Date cells depend on style
Excel dates are number cells holding days-since-1900-01-01 as a floating point internally.
Whether the cell shows as 2026/05/01 depends on the numFmtId of the applied style.
I ran a Date field through scndry’s implementation. The output looked like this.
<c r="C2" s="0" t="n"><v>25569.375</v></c>
<!-- 1970-01-01 09:00 JST -->
<c r="C4" s="0" t="n"><v>46143.75386275463</v></c>
<!-- 2026-05-01 18:05 JST -->
The cell is written as t="n" (number).
Style s="0" points to cellXfs index 0 in xl/styles.xml, which means numFmtId="0" (General).
Open it as-is and 46143.75... shows up unchanged.
To present it as a date, define a date format (e.g., yyyy/mm/dd) in styles.xml at scaffold creation, and assign that style ID to the cell elements on the stream side.
But the scaffold’s style table is fixed — you can’t add formats after the fact.
If both yyyy/mm/dd and yyyy-mm-dd hh:mm:ss are required, define both at scaffold time.
What’s even more annoying is the 1900 date system bug.
Excel inherited a bug from Lotus 1-2-3 for compatibility, treating 1900-02-29 (which doesn’t exist) as a valid date.
As a result, serial values for 1900-03-01 onward are off by 1 from what they should be.
POI absorbs this internally, but if you compute serial values yourself, you have to match.
The error is only one day off, so tests rarely catch it.