Memory-Efficient Chunk Processing Jump to heading

Regional OpenStreetMap extracts routinely exceed 10–50 GB once decompressed from PBF, and the naive approach — read every node, way, and relation into a single DataFrame or in-memory graph before transforming anything — turns a routine ingest into an out-of-memory (OOM) crash. The failure is rarely graceful: a dense urban extract inflates from a few gigabytes on disk to tens of gigabytes of Python objects, the garbage collector starts thrashing over tens of millions of tiny tag dictionaries, the kernel OOM-killer reaps the worker mid-write, and you are left with a half-written output file and no checkpoint to resume from. This page shows how to make memory a fixed budget rather than a function of dataset size: stream the extract through a bounded buffer, normalize each window in place, and flush validated records to columnar storage before the next window is read, so resident memory stays flat whether you are processing a city or a continent.

Memory-efficient chunk processing relies on a strict decoupling of I/O ingestion from transformation logic. Rather than loading complete node, way, and relation collections into RAM, the pipeline iterates over fixed-size feature windows, applies vectorized operations, and flushes validated records before advancing. The foundational pattern is a generator-style handler that maintains a bounded in-memory buffer, triggers normalization at a predefined threshold, and serializes outputs to a columnar format such as Apache Parquet so downstream stages can scan only the columns and row groups they need.

Prerequisite concepts Jump to heading

This workflow is the memory-budget stage of the Parsing & Tag Normalization Workflows pipeline, and it assumes two foundations are already understood. First, because the buffer fills with one primitive at a time, the structural rules in the Node-Way-Relation Data Model determine what each buffered row must carry — a node holds a coordinate, a way holds only the integer IDs of its member nodes, so geometry is a deferred join you must plan the chunk schema around. Second, the streaming unit is dictated by the file format: the PBF File Structure Deep Dive explains that a Blob fileblock is the smallest independently decodable span, which is why libosmium can hand you features incrementally instead of forcing a whole-file read. When throughput rather than memory is your binding constraint, the process-level concurrency in Async PBF Parsing with Pyrosm is the complementary pattern — this page keeps a single worker’s footprint flat; that page fans bounded work across cores.

Specification & format reference Jump to heading

The chunk schema and the buffer ceiling are the two design knobs that decide whether memory stays bounded. The table below lists the format-level facts that constrain both.

Surface	Value / rule	Why it bounds memory
PBF `Blob` (uncompressed)	32 MiB spec ceiling, 16 MiB recommended	libosmium decodes one block at a time, so the reader never holds the whole file resident.
`PrimitiveBlock` `granularity`	100 nanodegrees (default)	Coordinates arrive as `int64`, not float — store them as integers in the chunk to halve column width.
`DenseNodes` encoding	delta-encoded `int64` arrays	Decoded inside libosmium; your buffer receives resolved values, so accumulator state never leaks into Python.
Tag dictionary	free-form `string → string`	The dominant memory cost; serialize to one JSON UTF-8 column, not a per-chunk Arrow `Struct`, to avoid schema divergence across windows.
Parquet row group	128 MiB typical target	Sets the natural lower bound on `chunk_size`; smaller chunks waste row-group overhead, larger ones defeat the memory cap.
ZSTD compression	level 1–9	Trades flush CPU for disk footprint; level 3 is the usual streaming sweet spot.

The decisive schema choice is how tags are stored. OSM tags are an unbounded, contributor-defined key space, so inferring an Arrow Struct per chunk produces a schema that drifts between windows and breaks any later pl.concat/pl.scan_parquet merge. Serializing each tag dictionary to a single JSON string column keeps every chunk schema identical and defers the expensive structural decode to the consumer that actually needs it. Applying the controlled vocabularies from Tag Taxonomy & Key-Value Standards before the JSON is written keeps that deferred decode cheap and predictable.

Step-by-step implementation Jump to heading

The handler caps the buffer at chunk_size records; when the cap is hit it materializes a Polars DataFrame, writes it compressed, and clears for the next window. The bound on the buffer — not the size of the extract — is what fixes peak memory.

Subclass the streaming handler. Extend osmium.SimpleHandler so node() and way() callbacks append a flat row to a bounded list rather than a growing global collection.
Flatten tags at append time. Serialize each tag dict to JSON immediately so the buffer holds compact strings, not nested Python objects the GC must walk.
Flush at the threshold. When the buffer length reaches chunk_size, write a Polars DataFrame to a .parquet.tmp file with ZSTD, then atomically rename it to its final name.
Drain the tail. After apply_file returns, call finalize() once to flush the final partial buffer.
Record a manifest. Log each chunk’s index and row count so an interrupted run can resume from the last committed chunk instead of restarting.

python

from __future__ import annotations

import json
import logging
from pathlib import Path

import osmium
import polars as pl

logger = logging.getLogger(__name__)


class ChunkedOSMHandler(osmium.SimpleHandler):
    """Stream an OSM extract into bounded Parquet chunks.

    Tags are serialised to JSON strings so Polars writes them as a flat
    UTF-8 column instead of inferring a (potentially divergent) Struct
    schema across chunks. Way node refs are reduced to their integer IDs.

    Usage:
        handler = ChunkedOSMHandler(chunk_size=250_000)
        handler.apply_file("extract.osm.pbf", locations=True, idx="flex_mem")
        handler.finalize()
    """

    def __init__(self, chunk_size: int = 250_000, output_dir: Path = Path("./chunks")):
        super().__init__()  # required by the pyosmium C++ binding
        self.chunk_size = chunk_size
        self.output_dir = output_dir
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self._buffer: list[dict] = []
        self._chunk_idx = 0

    def _flush_buffer(self) -> None:
        if not self._buffer:
            return
        df = pl.DataFrame(self._buffer)
        tmp_path = self.output_dir / f"osm_chunk_{self._chunk_idx:04d}.parquet.tmp"
        final_path = self.output_dir / f"osm_chunk_{self._chunk_idx:04d}.parquet"
        df.write_parquet(str(tmp_path), compression="zstd")
        tmp_path.rename(final_path)  # atomic move on POSIX file systems
        logger.info("flushed chunk %04d (%d rows)", self._chunk_idx, df.height)
        self._buffer.clear()
        self._chunk_idx += 1

    def _maybe_flush(self) -> None:
        if len(self._buffer) >= self.chunk_size:
            self._flush_buffer()

    def node(self, n: osmium.osm.Node) -> None:
        self._buffer.append({
            "type": "node",
            "id": n.id,
            "lat": n.location.lat if n.location.valid() else None,
            "lon": n.location.lon if n.location.valid() else None,
            "node_refs": None,
            "tags": json.dumps({t.k: t.v for t in n.tags}),
        })
        self._maybe_flush()

    def way(self, w: osmium.osm.Way) -> None:
        self._buffer.append({
            "type": "way",
            "id": w.id,
            "lat": None,
            "lon": None,
            "node_refs": [nr.ref for nr in w.nodes],
            "tags": json.dumps({t.k: t.v for t in w.tags}),
        })
        self._maybe_flush()

    def finalize(self) -> None:
        """Call once after ``apply_file`` returns to drain the final chunk."""
        self._flush_buffer()
        logger.info("finished: %d chunks written", self._chunk_idx)

The atomic .tmp → final rename is not cosmetic: it guarantees that a downstream pl.scan_parquet glob never observes a partially written file, so a crash mid-flush leaves the chunk directory in a consistent state where every visible osm_chunk_NNNN.parquet is complete. This idempotent-write discipline is what makes the whole stage resumable, and it is the same contract the broader Error Handling in Large OSM Extracts workflow depends on when it triages defective records.

Tag normalization at the chunk boundary Jump to heading

Raw OSM tags are heterogeneous across contributor communities and regional conventions — highway=primary, highway=Primary, and a stray highway=trunk_link can all coexist in one extract. Normalize each chunk in place, while it is still a small bounded DataFrame, rather than over the whole extract. Precompiled regexes and static lookup dictionaries keep the per-row cost flat, and because the operation is vectorized in Polars it runs once per column instead of once per feature:

python

import polars as pl

_HIGHWAY_CANON = {
    "primary": "arterial", "primary_link": "arterial",
    "secondary": "arterial", "tertiary": "collector",
    "residential": "local", "unclassified": "local",
    "trunk": "trunk", "trunk_link": "trunk", "motorway": "motorway",
}


def normalize_chunk(df: pl.DataFrame) -> pl.DataFrame:
    """Canonicalise the highway class inside a single bounded chunk."""
    return df.with_columns(
        pl.col("tags")
        .str.json_path_match("$.highway")
        .str.to_lowercase()
        .replace(_HIGHWAY_CANON, default=None)
        .alias("highway_class")
    )

Normalization routines should be stateless and driven by explicit configuration rather than runtime inference, so identical inputs yield identical outputs across execution environments. The full controlled-vocabulary registries live in Batch Attribute Mapping Strategies; applying them per chunk means the regex cleaning pass operates on a bounded memory slice and never materializes the whole extract to harmonize a single key.

Validation & error-handling matrix Jump to heading

Error condition	Root cause	Detection	Remediation
`MemoryError` during buffer growth	`chunk_size` too large for available RAM × row width	RSS climbs to the ceiling before first flush	Lower `chunk_size`; estimate `chunk_size × bytes_per_row` against the budget
GC thrashing / stalled throughput	Tags held as nested dicts, not JSON strings	CPU high, rows/sec falling, frequent gen-2 collections	Serialize tags at append time; keep the buffer flat
Partial / corrupt Parquet read downstream	Consumer read a file mid-write	`scan_parquet` raises on a truncated footer	Use the atomic `.tmp` → rename pattern; never write to the final path directly
Schema mismatch on merge	Per-chunk Arrow `Struct` inferred from divergent tags	`pl.concat` raises on column-type mismatch	Store tags as one JSON UTF-8 column so every chunk schema is identical
Run aborts, no resume point	No manifest of committed chunks	Re-run reprocesses from chunk 0	Persist a checkpoint manifest of `(chunk_idx, row_count, source_seq)`
Disk fills mid-flush	ZSTD level too low / no spill headroom	`OSError: No space left on device`	Raise ZSTD level, route spill to a dedicated NVMe volume, monitor free space
`KeyError` on unresolved node ref	Way clipped at extract boundary	Node ID absent from location store	Pass `locations=True, idx="flex_mem"`; quarantine boundary ways

Performance & scale considerations Jump to heading

Peak resident memory in this design is dominated by exactly one quantity — the in-flight buffer — and is therefore predictable in advance. With a buffer admitting up to chunk_size rows of average serialized width $\bar{w}_{\text{row}}$ , plus a transient copy held while Polars materializes the DataFrame for the flush, the working set is bounded by:

M_{\text{peak}} \approx 2 \times \text{chunk\_size} \times \bar{w}_{\text{row}}

The factor of two accounts for the buffer and its DataFrame copy coexisting briefly during _flush_buffer. This is why chunk_size is the primary memory dial: it is linear and knowable, so you size it from your budget rather than discovering the ceiling by crashing. For typical OSM rows (an int64 id, two coordinate columns, and a JSON tag string averaging a few hundred bytes), a chunk_size of 250,000 keeps the working set well under a gigabyte, leaving headroom for the libosmium location cache.

The location cache itself is the other scaling constraint. Resolving way geometry requires every node reference to map to a coordinate, and a planet-scale node index will not fit in RAM — pass idx="flex_mem" to let libosmium spill the location store, or use a disk-backed idx="dense_file_array,locations.cache" for planet extracts. When even a single bounded stream is too slow, scale horizontally rather than vertically: split the source with osmium extract by administrative boundary using the tiling strategies in Spatial Indexing for OSM Extracts, run one ChunkedOSMHandler per tile, and merge the chunk directories with a lazy pl.scan_parquet("**/osm_chunk_*.parquet") so no stage ever materializes all tiles at once.

Failure modes & gotchas Jump to heading

Inferring an Arrow Struct for tags is the silent killer. It works on a single chunk and breaks only at merge time when a later chunk introduces a key the first did not. Always store tags as one JSON UTF-8 column; decode structure downstream where the consumer controls the schema.
Writing straight to the final Parquet path invites torn reads. A consumer globbing the directory can pick up a file with no footer. The .tmp → rename is atomic on POSIX and is the cheapest possible safety guarantee — never skip it.
finalize() is mandatory. The last buffer is almost never exactly chunk_size rows, so without the explicit final flush you silently drop the tail of the extract.
super().__init__() is not optional. The pyosmium C++ binding requires it; omitting it produces a confusing segfault rather than a Python error.
Forgetting locations=True turns way callbacks into KeyErrors. Without the location cache, w.nodes references resolve to nothing; set it (and an appropriate idx) at apply_file time.
chunk_size measured in rows, not bytes, drifts with tag density. A buffer tuned on sparse rural data will overshoot on a dense city where tag strings are long; size against the widest expected row, not the average.
Spill-to-disk on a slow volume becomes the new bottleneck. If flushes back up, your disk — not your CPU — is the limit; route the chunk directory to NVMe before widening any parallelism.

Integration points Jump to heading

The chunk directory this stage produces is a lazy, columnar staging area that the next pipeline stage scans without re-reading the source extract. Network topology construction is the most common consumer: OSMnx Graph Conversion Techniques rebuild routable graphs from these chunks, resolving way node_refs against node coordinates only for the rows a query touches. The wiring below merges the chunks lazily, filters to ways with a resolved class, and hands the result to the graph stage:

python

import logging

import polars as pl

logger = logging.getLogger(__name__)


def stage_for_graph(chunk_dir: str = "./chunks") -> pl.DataFrame:
    """Lazily merge chunk Parquet files and select normalized ways."""
    lf = pl.scan_parquet(f"{chunk_dir}/osm_chunk_*.parquet")
    ways = (
        lf.filter(pl.col("type") == "way")
        .filter(pl.col("tags").str.json_path_match("$.highway").is_not_null())
        .select(["id", "node_refs", "tags"])
    )
    df = ways.collect(streaming=True)  # streaming engine keeps the merge bounded
    logger.info("staged %d ways for graph conversion", df.height)
    return df

Because collect(streaming=True) runs Polars’ out-of-core engine, even the merge step honours the memory budget rather than materializing every chunk. Before geometry is rebuilt and weighted for routing, coordinates promoted from the stored nanodegree integers should be reprojected following Coordinate Reference Systems in OSM so distance and area measurements are correct in the analytical CRS.

Frequently Asked Questions Jump to heading

How do I choose the right chunk_size?

Size it from your memory budget, not by trial and error. Peak memory is roughly 2 × chunk_size × average_row_bytes, where the factor of two covers the buffer and its DataFrame copy during a flush. For typical OSM rows with a few-hundred-byte JSON tag string, 250,000 rows stays under a gigabyte. Tune against the widest expected row (dense urban tags), not the average, so a city extract does not overshoot a limit calibrated on rural data.

Why serialize tags to JSON instead of keeping them as a nested column?

OSM tags are an unbounded, contributor-defined key space. If you let Polars or Arrow infer a Struct per chunk, the schema drifts between windows and any later concat or scan_parquet merge fails on a type mismatch. A single JSON UTF-8 column keeps every chunk schema identical and defers structural decoding to the consumer that actually needs it — which is also cheaper for the garbage collector during buffering.

What makes the chunk writes safe to resume after a crash?

Two things: atomic writes and a manifest. Each chunk is written to a .parquet.tmp file and only renamed to its final name once complete, so a downstream glob never sees a torn file. Logging each committed (chunk_idx, row_count) lets an interrupted run skip already-written chunks and resume from the last good one instead of restarting from zero.

When should I switch from a single stream to parallel tiles?

When a single bounded stream is throughput-limited rather than memory-limited. Split the source with osmium extract by administrative boundary, run one ChunkedOSMHandler per tile into its own directory, and merge with a lazy pl.scan_parquet. If memory is still the binding constraint within each worker, keep tiles small; if you simply need to overlap I/O and CPU, the process-pool approach in Async PBF Parsing with Pyrosm is the better fit.

Why do my way rows raise KeyError on node references?

You parsed without a location cache. Way callbacks resolve w.nodes against libosmium’s node-location store, so call apply_file(path, locations=True, idx="flex_mem") (or a disk-backed index for planet extracts). Ways clipped at the extract boundary will still reference nodes outside the file — quarantine those rather than letting the KeyError abort the stream.

In this section Jump to heading

The focused guides below drill into the two levers that keep a stream inside its budget:

A Bounded LRU Node Cache for OSM Streaming — capping the node-location store so way reconstruction stays within a fixed memory ceiling.
Sizing PBF Chunk Batches to a Memory Budget — deriving a safe batch size from element width and the RAM you can spend.

Async PBF Parsing with Pyrosm — process-pool concurrency for when throughput, not memory, is the binding constraint.
Batch Attribute Mapping Strategies — the controlled vocabularies each chunk normalizes against.
Error Handling in Large OSM Extracts — triaging the boundary and malformed records this stage quarantines.
OSMnx Graph Conversion Techniques — turning the staged chunk directory into routing-ready graphs.
PBF File Structure Deep Dive — why a block, not a byte offset, is the unit libosmium streams.
Spatial Indexing for OSM Extracts — boundary tiling that lets you scale a single stream horizontally.

This guide is part of Parsing & Tag Normalization Workflows; return to that overview to follow the data through async ingestion, attribute mapping, error triage, and routing-graph conversion.

Memory-Efficient Chunk Processing Jump to heading#

Prerequisite concepts Jump to heading#

Specification & format reference Jump to heading#

Step-by-step implementation Jump to heading#

Tag normalization at the chunk boundary Jump to heading#

Validation & error-handling matrix Jump to heading#

Performance & scale considerations Jump to heading#

Failure modes & gotchas Jump to heading#

Integration points Jump to heading#

Frequently Asked Questions Jump to heading#

In this section Jump to heading#

Related Jump to heading#

Memory-Efficient Chunk Processing Jump to heading

Prerequisite concepts Jump to heading

Specification & format reference Jump to heading

Step-by-step implementation Jump to heading

Tag normalization at the chunk boundary Jump to heading

Validation & error-handling matrix Jump to heading

Performance & scale considerations Jump to heading

Failure modes & gotchas Jump to heading

Integration points Jump to heading

Frequently Asked Questions Jump to heading

In this section Jump to heading

Related Jump to heading