Async PBF Parsing with Pyrosm Jump to heading

OpenStreetMap PBF extracts routinely exceed multi-gigabyte thresholds, which turns a single synchronous parse into the longest pole in any production spatial ETL pipeline. When one thread reads a 4 GB country extract end to end before a single feature is normalized, the CPU sits idle during disk reads and the disk sits idle during tag validation — the two costliest stages never overlap, and a planet-scale ingest that should finish overnight runs for days. This page shows how to keep both resources saturated by wrapping Pyrosm’s blocking reader in a ProcessPoolExecutor and streaming results through a bounded asyncio queue, so parsing and downstream transformation proceed concurrently without exhausting memory.

Pyrosm is a Cython-backed library that wraps libosmium to read PBF files into GeoDataFrames. Its native API is synchronous and reads the full file in a single pass — it does not support seeking to arbitrary byte offsets mid-stream, so you cannot naively shard one file across threads. The pattern below instead achieves concurrency at the file granularity: each worker process owns one regional tile, and asyncio orchestrates the handoff so that parsing, normalization, and writing overlap with strict back-pressure.

Prerequisite concepts Jump to heading

This workflow sits in the ingestion stage of the Parsing & Tag Normalization Workflows pipeline, and it assumes three foundations are already in place. First, you should understand how PBF lays bytes on disk — the PBF File Structure Deep Dive explains why a Blob boundary, not a byte offset, is the only safe place to split a file, which is precisely why tile-level (not offset-level) parallelism is the workable model here. Second, because Pyrosm hands you ways with geometry already reconstructed, the reference-resolution rules in the Node-Way-Relation Data Model determine which tiling strategy keeps way members intact at tile boundaries. Third, the canonicalization targets you apply in Value Standardization & Regex Cleaning define the schema each worker must emit, so the normalization step in this page stays consistent with the rest of the pipeline.

Specification & API reference Jump to heading

Pyrosm exposes a small, mostly synchronous surface. The fields and limits that matter for a concurrent design are summarized below.

Surface	Signature / value	Concurrency relevance
`OSM(filepath, bounding_box=None)`	constructor	Bounding box clips at read time; cheaper than a separate `osmium extract` only for small clips.
`OSM.get_network(network_type=...)`	returns `GeoDataFrame`	Blocking, single-pass, releases the GIL inside libosmium but materializes the full result before returning.
`OSM.get_pois`, `get_buildings`, `get_landuse`	returns `GeoDataFrame`	One full pass per call — call once per worker, not once per feature class, to avoid re-reading the file.
`OSM(..., nodes=True)`	flag	Required if you must re-attach point geometry downstream; roughly doubles peak worker memory.
PBF `Blob` max size	32 MiB uncompressed (spec ceiling)	Sets the lower bound on how finely libosmium can stream; you cannot split below a block.

The decisive constraint is that Pyrosm materializes one GeoDataFrame per feature type. Serializing those frames back across the ProcessPoolExecutor IPC boundary is expensive (pickle + copy), so each worker should convert its result to a compact pyarrow.Table and drop heavyweight geometry before returning. This keeps the inter-process payload small while preserving the tag columns the next stage needs.

Step-by-step implementation Jump to heading

The architecture orchestrates a bounded queue of futures: a producer submits one tile per worker and the consumer drains results as they complete. The bound on the queue — not the worker count alone — is what caps peak memory.

Tile the extract. Split the source by bounding box with osmium extract --strategy=smart so every tile carries the way members needed for geometry reconstruction at its edges. Avoid complete_ways only if you do not need boundary-spanning ways.
Define an isolated worker. Each process opens its own OSM instance — no shared state, no lock contention — parses one tile, and returns an Arrow table with geometry dropped.
Run a producer task. Submit tiles to the pool and push the returned Future objects onto a bounded asyncio.Queue; await queue.put(...) blocks once the queue is full, applying back-pressure on submission.
Drain in the consumer. await queue.get(), resolve each future off the event loop with asyncio.to_thread(future.result), and yield non-empty tables downstream.
Terminate cleanly. A None sentinel signals end-of-stream so the consumer’s while loop exits and the pool context manager joins all workers.

python

import asyncio
import logging
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from typing import AsyncIterator
import pyarrow as pa
from pyrosm import OSM

MAX_WORKERS = 4
QUEUE_MAXSIZE = 8

logger = logging.getLogger(__name__)


def _parse_tile_worker(tile_path: str) -> pa.Table:
    """Isolated worker: parse one PBF tile and return an Arrow Table.

    Each worker process owns its own OSM instance and GeoDataFrame, so
    there is no shared-memory contention. Tags are preserved as columns.
    """
    try:
        reader = OSM(tile_path)
        gdf = reader.get_network(network_type="driving")
        if gdf is None or gdf.empty:
            return pa.table({})
        # Drop geometry: Arrow has no native geometry type; callers can
        # reconstruct from WKB if needed.
        df = gdf.drop(columns="geometry")
        return pa.Table.from_pandas(df, preserve_index=False)
    except Exception as e:
        logger.error("Worker failed for tile %s: %s", tile_path, e)
        return pa.table({})


async def async_tile_stream(
    tile_paths: list[Path],
) -> AsyncIterator[pa.Table]:
    """Yield normalised Arrow tables for each PBF tile, max QUEUE_MAXSIZE in flight."""
    queue: asyncio.Queue = asyncio.Queue(maxsize=QUEUE_MAXSIZE)

    async def producer() -> None:
        with ProcessPoolExecutor(max_workers=MAX_WORKERS) as executor:
            for path in tile_paths:
                future = executor.submit(_parse_tile_worker, str(path))
                await queue.put(future)
        await queue.put(None)  # sentinel

    asyncio.create_task(producer())

    while True:
        future = await queue.get()
        if future is None:
            break
        try:
            result = await asyncio.to_thread(future.result)
            if result.num_rows:
                yield result
        except Exception as e:
            logger.warning("Tile processing failed: %s", e)
        finally:
            queue.task_done()

Tag normalization on the Arrow table Jump to heading

Raw OSM tags exhibit inconsistent casing, localized abbreviations, and deprecated keys. Normalize on the Arrow table before yielding it downstream — this is the cheapest point because the table is already in columnar memory and has not yet been re-serialized. The mapping here is a minimal example of the controlled vocabularies defined in full by Batch Attribute Mapping Strategies.

python

import pyarrow as pa
import pyarrow.compute as pc  # noqa: F401  # used for further column ops


def normalize_highway_column(table: pa.Table) -> pa.Table:
    """Map raw highway values to canonical routing classes."""
    HIGHWAY_MAP = {
        "motorway": "motorway", "trunk": "trunk",
        "primary": "arterial", "secondary": "arterial",
        "tertiary": "collector", "residential": "local",
        "unclassified": "local", "service": "access",
    }
    if "highway" not in table.schema.names:
        return table
    col = table.column("highway")
    # Arrow has no built-in dictionary remap; round-trip through pandas .map.
    import pandas as pd  # noqa: F401
    s = col.to_pandas().map(HIGHWAY_MAP)
    new_col = pa.array(s.tolist(), type=pa.string())
    idx = table.schema.get_field_index("highway")
    return table.set_column(idx, "highway", new_col)

When a worker encounters malformed geometries or missing mandatory attributes, it should log the failure, quarantine the record to a dead-letter Parquet partition, and continue — never raise into the event loop. That contract is shared with Error Handling in Large OSM Extracts, which triages the quarantined records this stage produces and decouples schema enforcement from raw ingestion.

Validation & error-handling matrix Jump to heading

Error condition	Root cause	Detection	Remediation
`BrokenProcessPool`	Worker segfaulted in libosmium on a corrupt blob	`future.result()` raises in consumer	Recreate the pool, isolate and re-tile the offending file, skip on second failure
Silent empty result	Tile has no features of the requested `network_type`	`result.num_rows == 0` guard	Drop quietly; log only if every tile in a batch is empty (mis-set filter)
`MemoryError` in worker	`nodes=True` on a dense urban tile	OOM-killed worker → `BrokenProcessPool`	Reduce tile size (raise H3 resolution), lower `MAX_WORKERS`
Schema drift between tiles	Optional tag columns absent in sparse tiles	`pa.concat_tables` raises on schema mismatch	`promote_options="permissive"` or unify schema before concat
Event loop stalls	`future.result()` called directly (blocking)	Consumer throughput drops to one tile at a time	Wrap with `asyncio.to_thread(future.result)`
Producer outruns consumer	`QUEUE_MAXSIZE` too large; memory climbs	RSS grows linearly with tiles read	Lower `QUEUE_MAXSIZE`; the bounded `put` then throttles submission

Performance & scale considerations Jump to heading

Peak resident memory is dominated not by the number of CPU cores but by how many parsed tables can exist at once. With w worker processes each holding one in-flight GeoDataFrame and a queue admitting up to q completed Arrow tables, the working set is bounded by:

M_{\text{peak}} \approx (w + q) \times \bar{s}_{\text{table}}

where $\bar{s}_{\text{table}}$ is the average serialized table size. This is why the queue bound (QUEUE_MAXSIZE) is the primary memory dial: doubling MAX_WORKERS buys throughput up to your core count, but doubling the queue depth only buys latency smoothing at a linear memory cost. For a typical European country extract tiled at H3 resolution 5, tables average tens of megabytes, so MAX_WORKERS = 4, QUEUE_MAXSIZE = 8 keeps the working set comfortably under a few gigabytes.

Tile granularity controls the parallelism-to-overhead ratio. H3 resolution 5 (average cell area ~252 km²) is a reasonable default for European country-scale extracts; drop to resolution 4 for continental processing where per-tile fixed costs would otherwise dominate. Tiles that are too small spend more time in process spin-up and IPC than in parsing; tiles that are too large defeat the back-pressure budget and risk OOM. When memory rather than throughput is the binding constraint, prefer the streaming generators in Memory-Efficient Chunk Processing over widening the pool.

Failure modes & gotchas Jump to heading

asyncio.create_task without a reference can be garbage-collected. Hold the producer task (task = asyncio.create_task(producer())) so it is not silently dropped mid-stream; otherwise the queue never receives its sentinel and the consumer hangs.
Calling future.result() directly blocks the event loop. It looks harmless because the future is “probably done,” but a slow tile freezes every other coroutine. Always offload with asyncio.to_thread.
ProcessPoolExecutor re-imports the module in each worker. Heavy module-level side effects (opening files, configuring logging handlers) run once per process; keep worker imports lazy and idempotent.
Geometry dropped at the worker is gone. If a downstream stage needs coordinates, serialize a WKB column inside the worker rather than re-running the parse — re-parsing a tile is far costlier than carrying bytes.
Schema-permissive concat hides real drift. promote_options="permissive" will paper over a tile that is genuinely missing a mandatory column; assert the expected schema explicitly when correctness matters more than completion.
Bounding-box clipping at read time is not free. OSM(..., bounding_box=...) still scans the whole file; for repeated runs, pre-tile once with osmium extract and cache the tiles.

Integration points Jump to heading

Once normalized, the streaming Arrow tables flow into network analysis. Feeding OSMnx Graph Conversion Techniques requires reconstructing GeoDataFrames from the Arrow tables — re-attaching geometry from a WKB column, or re-running the parse with nodes=True — before passing them to ox.graph_from_gdfs. The wiring below consumes the async stream and hands each normalized table to the next stage:

python

import logging
import pyarrow as pa

logger = logging.getLogger(__name__)


async def run_ingest(tile_paths: list[Path]) -> pa.Table:
    """Drive the async stream, normalize, and accumulate for graph conversion."""
    batches: list[pa.Table] = []
    async for table in async_tile_stream(tile_paths):
        normalized = normalize_highway_column(table)
        batches.append(normalized)
        logger.info("ingested tile: %d rows", normalized.num_rows)
    if not batches:
        return pa.table({})
    # Permissive promotion tolerates sparse tiles missing optional columns.
    return pa.concat_tables(batches, promote_options="permissive")

The combined table is then ready for projection to a working CRS following Coordinate Reference Systems in OSM before geometry is rebuilt and weighted for routing.

Frequently Asked Questions Jump to heading

Why not just use threads instead of processes for parsing?

Pyrosm releases the GIL inside libosmium, so threads help during the raw read, but the pandas/GeoDataFrame construction afterward is largely GIL-bound. Processes give you true parallel CPU for both the parse and the per-tile DataFrame build, at the cost of IPC serialization — which is why each worker returns a compact Arrow table with geometry dropped.

Can Pyrosm read a single huge file in parallel without tiling?

No. The native API reads a file in one synchronous pass and cannot seek to an arbitrary byte offset mid-stream, and the smallest safe split point is a PBF Blob boundary, not a byte. Parallelism is therefore achieved at the file granularity — pre-tile the extract with osmium extract and assign one tile per worker.

How do I stop the producer from exhausting memory?

Bound the queue. await queue.put(future) blocks once QUEUE_MAXSIZE futures are pending, so the producer cannot submit faster than the consumer drains. Peak memory scales with (MAX_WORKERS + QUEUE_MAXSIZE) × average table size, so tune the queue depth first when RSS climbs.

My consumer processes tiles one at a time — what went wrong?

You almost certainly called future.result() directly, which blocks the event loop until that tile finishes and serializes the whole stream. Resolve futures off the loop with await asyncio.to_thread(future.result) so other coroutines keep running.

Why does concatenating tile tables raise a schema mismatch?

Sparse tiles omit optional tag columns that dense tiles include, so the Arrow schemas differ. Use pa.concat_tables(..., promote_options="permissive") to unify them, but assert the mandatory columns explicitly first — permissive promotion will otherwise mask a genuinely missing required field.

In this section Jump to heading

The focused guides below extend this concurrent-parsing pattern:

Streaming PBF Blocks Through an Asyncio Queue — a bounded producer-consumer queue that decouples block decoding from downstream processing.
Tuning Pyrosm Worker Count for PBF Parsing — sizing the process pool to cores, memory, and I/O so throughput scales without thrashing.
Speed Up OSM Parsing with Multiprocessing in Python — fanning independent fileblocks across a process pool with a final reduce.

Memory-Efficient Chunk Processing — streaming generators and spill-to-disk when memory, not throughput, is the binding constraint.
Batch Attribute Mapping Strategies — the schema registries and controlled vocabularies each worker emits against.
Error Handling in Large OSM Extracts — triaging the records this stage quarantines.
OSMnx Graph Conversion Techniques — turning normalized tables into routing-ready NetworkX graphs.
PBF File Structure Deep Dive — why blocks, not byte offsets, bound how finely a file can be split.
Spatial Indexing for OSM Extracts — H3 and R-tree tiling strategies that drive the parallelism granularity.

This guide is part of Parsing & Tag Normalization Workflows; return to that overview to follow the data through normalization, error triage, and routing-graph conversion.

Async PBF Parsing with Pyrosm Jump to heading#

Prerequisite concepts Jump to heading#

Specification & API reference Jump to heading#

Step-by-step implementation Jump to heading#

Tag normalization on the Arrow table Jump to heading#

Validation & error-handling matrix Jump to heading#

Performance & scale considerations Jump to heading#

Failure modes & gotchas Jump to heading#

Integration points Jump to heading#

Frequently Asked Questions Jump to heading#

In this section Jump to heading#

Related Jump to heading#

Async PBF Parsing with Pyrosm Jump to heading

Prerequisite concepts Jump to heading

Specification & API reference Jump to heading

Step-by-step implementation Jump to heading

Tag normalization on the Arrow table Jump to heading

Validation & error-handling matrix Jump to heading

Performance & scale considerations Jump to heading

Failure modes & gotchas Jump to heading

Integration points Jump to heading

Frequently Asked Questions Jump to heading

In this section Jump to heading

Related Jump to heading