Why use processes instead of threads to parse OSM PBF files?

Reading the protobuf releases the GIL, but the regex tag cleaning, attribute mapping, and geometry validation that follow are pure-Python and CPU-bound, so threads serialize behind the GIL. Processes give true parallel CPU at the cost of pickling data across the IPC boundary, which is why the unit of work is a chunk of elements rather than a single feature.

How do I stop worker memory from climbing during a long PBF parse?

Disable automatic GC in a worker initializer and call gc.collect() only at chunk boundaries, and set max_tasks_per_child so each worker is retired after a fixed number of chunks. That releases C-extension memory arenas back to the OS and keeps resident memory flat instead of monotonically rising.

What causes a BrokenProcessPool error and how do I fix it?

A worker was OOM-killed or segfaulted in a C extension such as GEOS. Lower max_workers, shrink the chunk size so payloads fit, pin shapely 2.0 or newer, and re-tile any single file that fails twice so the offending blob is isolated.

Why is my multiprocessing speedup sub-linear?

Large feature dictionaries are being pickled across the IPC boundary, so serialization dominates. Shrink the chunks and drop geometry to WKB bytes before returning from the worker, and set OMP_NUM_THREADS=1 so BLAS and GEOS do not oversubscribe cores with their own threads.

Speed up OSM parsing with multiprocessing in Python Jump to heading

Task: parse a multi-gigabyte OpenStreetMap .pbf extract across all available CPU cores by submitting pre-chunked feature batches to a ProcessPoolExecutor, so tag normalization and geometry validation run in parallel instead of being serialized behind the Python GIL.

Prerequisites Jump to heading

Python 3.10+ (the snippet uses int | None union syntax and max_tasks_per_child, which lands in 3.11 — on 3.10 drop that one kwarg)
psutil>=5.9 for live memory probing inside the driver
A parser that yields pre-filtered element batches — pyrosm>=0.6 or pyosmium>=3.6 feeding the chunk_generator
shapely>=2.0 if workers validate geometry (the 1.x branch has GEOS serialization regressions under fork)
Environment: set OMP_NUM_THREADS=1 so BLAS/GEOS C libraries do not spawn threads that fight the process pool
Enough free RAM for (max_workers + queue depth) × chunk size — confirm with free -m before a planet-scale run

Why processes, not threads Jump to heading

Streaming the binary protobuf is I/O-bound and releases the GIL inside the C extension, but the work that follows — regex tag cleaning, attribute mapping, and geometry checks — is pure-Python and CPU-bound, so threads serialize behind the GIL and buy you nothing. Processes give true parallelism at the cost of pickling data across the IPC boundary, which is why the unit of work here is a chunk of elements rather than a single feature. This page is the per-core execution layer beneath Async PBF Parsing with Pyrosm; where that workflow overlaps disk reads with compute at the file granularity, this one fans the compute itself across cores. The canonical tag targets each worker emits are defined by Value Standardization & Regex Cleaning, and when memory rather than CPU is the binding constraint you should reach for the streaming generators in Memory-Efficient Chunk Processing instead of widening the pool.

Runnable solution Jump to heading

The driver submits each chunk as a Future, drains results with as_completed, and recycles workers periodically to contain C-extension leaks. Workers return a structured {"normalized": [...], "errors": [...]} payload so a single bad element never crashes the pool.

python

import os
import gc
import logging
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, as_completed
from typing import Iterator, Dict, List, Any
import psutil

logger = logging.getLogger(__name__)


def worker_initializer() -> None:
    """Disable automatic GC in workers; we trigger it manually at chunk boundaries."""
    gc.disable()


def parse_chunk(chunk_data: List[Dict[str, Any]]) -> Dict[str, Any]:
    """CPU-bound transformation: geometry validation + tag normalization.

    chunk_data holds pre-filtered OSM elements from one bounding box or feature
    class. Returns 'normalized' and 'errors' lists so failures surface without
    crashing the worker process.
    """
    normalized: List[Dict[str, Any]] = []
    errors: List[Dict[str, Any]] = []
    for idx, elem in enumerate(chunk_data):
        try:
            if not elem.get("tags"):
                continue
            normalized.append({
                "osm_id": elem["id"],
                "geometry": elem.get("geometry"),
                "tags": elem["tags"],
                "worker_pid": os.getpid(),
            })
        except Exception as e:  # noqa: BLE001 — quarantine, never raise into the pool
            errors.append({"index": idx, "osm_id": elem.get("id"), "error": str(e)})

    gc.collect()  # Manual collection at the chunk boundary to stabilize RSS.
    return {"normalized": normalized, "errors": errors}


def run_parallel_pipeline(
    chunk_generator: Iterator[List[Dict[str, Any]]],
    max_workers: int | None = None,
) -> Iterator[Dict[str, Any]]:
    """Submit chunks to a process pool and yield results as they complete."""
    if max_workers is None:
        max_workers = min(os.cpu_count() or 1, 8)

    available_mb = psutil.virtual_memory().available // (1024 ** 2)
    logger.info("Spawning %d workers; %d MB RAM available.", max_workers, available_mb)
    # Stop C/BLAS/GEOS libraries from spawning threads that compete with the pool.
    os.environ["OMP_NUM_THREADS"] = "1"

    with ProcessPoolExecutor(
        max_workers=max_workers,
        initializer=worker_initializer,
        mp_context=mp.get_context("spawn"),   # reproducible across Linux/macOS/Windows
        max_tasks_per_child=50,               # recycle workers to bound C-ext leaks
    ) as executor:
        futures = {
            executor.submit(parse_chunk, chunk): i
            for i, chunk in enumerate(chunk_generator)
        }
        for future in as_completed(futures):
            chunk_idx = futures[future]
            try:
                yield future.result()
            except Exception as e:  # noqa: BLE001
                logger.error("Chunk %d failed unrecoverably: %s", chunk_idx, e)
                continue

Keep the regex patterns and lookup tables at module scope so they are compiled once per interpreter and inherited by every worker rather than rebuilt on each call:

python

import re

# Module-level: compiled once, shared by all workers under spawn or fork.
_SPEED_RE = re.compile(r"^(\d+(?:\.\d+)?)(?:\s*(?:km/h|kmh|kph))?$", re.IGNORECASE)
_SURFACE_RE = re.compile(r"[^a-z0-9_]", re.IGNORECASE)

HIGHWAY_MAP = {
    "motorway": "motorway", "trunk": "trunk",
    "primary": "arterial", "secondary": "arterial",
    "tertiary": "collector", "residential": "local",
    "unclassified": "local", "service": "access",
}


def normalize_tags(tags: dict) -> dict:
    out: dict = {}
    out["road_class"] = HIGHWAY_MAP.get(tags.get("highway", ""))

    m = _SPEED_RE.match(str(tags.get("maxspeed", "")))
    out["maxspeed_kmh"] = float(m.group(1)) if m else None

    surface = _SURFACE_RE.sub("", tags.get("surface", "").lower())
    out["surface_clean"] = surface or None
    return out

Step-by-step walkthrough Jump to heading

worker_initializer disables GC. Generational garbage collection scans hurt during tight CPU loops over millions of short-lived element dicts. Disabling it in the initializer and calling gc.collect() only at the end of parse_chunk keeps resident memory flat without per-allocation overhead.
parse_chunk returns, never raises. Each element is wrapped in try/except; a malformed geometry or missing key is appended to errors with its osm_id so it can be routed to a dead-letter store, while the worker keeps going. A raised exception would otherwise poison the pool.
max_workers is capped. Defaulting to min(cpu_count, 8) avoids spawning 64 workers on a large host where IPC and memory, not cores, become the ceiling.
mp.get_context("spawn") is explicit. Spawn re-imports the module in a clean interpreter, so module-level state is deterministic across platforms and you avoid fork-after-threads deadlocks in libosmium/GEOS.
max_tasks_per_child=50 recycles workers. Long-lived processes accumulate memory in C extensions; retiring each worker after 50 chunks releases that arena back to the OS.
as_completed yields out of order. Results stream back as soon as any worker finishes, so a single slow chunk never blocks the others. The futures dict maps each future back to its chunk index for logging.

Verification Jump to heading

Confirm the pool is genuinely parallel and bounded:

Distinct PIDs. Aggregate worker_pid across returned normalized records — you should see close to max_workers distinct PIDs, and they should change over the run as max_tasks_per_child recycles them.
CPU saturation. htop (or psutil.cpu_percent(percpu=True)) should show all worker cores near 100% during the CPU-bound phase, not one core pinned while the rest idle.
Flat RSS. Watch psutil.Process().memory_info().rss for the driver and ps --ppid <driver_pid> -o rss for workers; resident memory should plateau, not climb monotonically. A steady climb means GC tuning or max_tasks_per_child is not taking effect.
Error accounting. Sum len(result["errors"]) across chunks; it should match the count of quarantined records in your dead-letter partition. A nonzero count with zero log lines means an exception is being swallowed silently.
Throughput. Expect a near-linear speedup up to physical core count; if doubling workers barely moves wall-clock time, the bottleneck is IPC serialization of large feature dicts, not CPU.

Common errors & fixes Jump to heading

Error	Root cause	One-line fix
`BrokenProcessPool`	A worker was OOM-killed or segfaulted in a C extension	Lower `max_workers`, shrink chunk size, and pin `shapely>=2.0`
`PicklingError: Can't pickle ...`	A chunk holds an unpicklable object (open file, lambda, GEOS handle)	Pass plain dicts/WKB only; build heavy objects inside the worker
RSS climbs until OOM	GC still running, or workers never recycled	Keep `gc.disable()` in the initializer and set `max_tasks_per_child`
Speedup is sub-linear	Giant feature dicts serialized across IPC	Shrink chunks; drop geometry to WKB bytes before returning
Hang with no output	`chunk_generator` is empty or blocks before yielding	Verify the upstream parser yields lists; log the chunk count first
`RuntimeError: ... fork before exec`	Default fork context after threads were started	Force `mp.get_context("spawn")` as shown
BLAS oversubscription stalls	C libs spawning threads per worker	Export `OMP_NUM_THREADS=1` before the pool starts

For jobs that stall mid-run, checkpoint chunk offsets to a SQLite WAL file so a restart resumes from the last committed offset instead of re-streaming the whole PBF, and profile suspected C-extension leaks with py-spy or tracemalloc rather than guessing. Records that workers quarantine should flow to the triage path described in Error Handling in Large OSM Extracts.

Spec reference Jump to heading

The Python concurrent.futures.ProcessPoolExecutor runs callables in a pool of worker processes that sidestep the GIL, but every argument and return value is pickled across the process boundary — keep them small. max_tasks_per_child (added in 3.11) restarts a worker after the given number of tasks to release accumulated resources. See the Python concurrent.futures documentation and the multiprocessing start methods reference for the spawn vs fork trade-offs.

Async PBF Parsing with Pyrosm — the file-granularity async layer this pool plugs into.
Memory-Efficient Chunk Processing — sizing chunks and spilling to disk when memory is the limit.
Value Standardization & Regex Cleaning — the canonical tag targets each worker normalizes against.
Error Handling in Large OSM Extracts — triaging the records workers quarantine to the dead-letter store.
OSMnx Graph Conversion Techniques — feeding normalized features into routing graphs after parsing.
Spatial Indexing for OSM Extracts — H3/R-tree tiling that decides chunk boundaries.

This guide is part of Async PBF Parsing with Pyrosm; return to that overview to see how per-core parsing fits the full async ingestion pipeline.

Speed up OSM parsing with multiprocessing in Python Jump to heading#

Prerequisites Jump to heading#

Why processes, not threads Jump to heading#

Runnable solution Jump to heading#

Step-by-step walkthrough Jump to heading#

Verification Jump to heading#

Common errors & fixes Jump to heading#

Spec reference Jump to heading#

Related Jump to heading#