Error Handling in Large OSM Extracts Jump to heading

A continental OpenStreetMap (OSM) extract is not a clean dataset — it is a multi-gigabyte stream of community-contributed primitives in which a single corrupt block, a stray control character in a tag value, or a way that references a node missing from the extract can abort an overnight ingest at hour six. The failure that hurts most is the silent one: a decoder that swallows an unresolved reference and emits a way with a truncated geometry, which then poisons a routing graph that looks structurally valid until a journey planner returns a route through a wall. Error handling in this stage is therefore not defensive boilerplate around the happy path; it is the contract that decides which records reach the sink, which are quarantined for review, and when the pipeline must stop rather than commit garbage. This page shows how to wrap the parse-and-normalize loop in deterministic exception boundaries, route defective records to a dead-letter queue, halt on systematic corruption with a circuit breaker, and resume from the last committed checkpoint without reprocessing the whole archive.

The flow has a single organizing principle: every record either commits, quarantines, or trips the breaker — there is no fourth state in which a defect is silently absorbed. The sections below build each branch.

Prerequisite concepts Jump to heading

This page sits in the resilience layer of the Parsing & Tag Normalization Workflows pipeline and assumes three foundations are already in place. First, you must know where it is safe to draw an error boundary: the PBF File Structure Deep Dive explains that a Blob is the smallest independently decodable unit, which is why a decode failure is scoped to a block — you discard the block, log its offset, and keep streaming rather than aborting the file. Second, the reference-resolution rules in the Node-Way-Relation Data Model define what “missing node” and “dangling member” actually mean, and therefore which exceptions are recoverable (skip the feature) versus fatal (the extract itself is truncated). Third, the canonical schema you validate against is the one produced by Value Standardization & Regex Cleaning — error handling enforces that schema; it does not invent one. Readers ingesting concurrently should also pair this with Async PBF Parsing with Pyrosm, whose workers produce the quarantined records this stage triages.

Specification & failure-surface reference Jump to heading

Before writing handlers, classify the failure surface against what the format actually guarantees. OSM PBF and XML each have well-defined points where corruption surfaces, and each maps to a different remediation tier.

Failure surface	Where it originates	Spec guarantee	Recoverable scope
Blob decode error	zlib stream truncated or `Blob` size exceeds the 32 MiB uncompressed ceiling	Each `Blob` is independently decodable	Block — discard and continue
Unresolved node reference	Way references a node ID absent from the extract	Geometry is a join, not inline	Feature — skip way, log ID
Dangling relation member	Relation references a way/node outside the bbox clip	Members may legitimately span clips	Feature — partial-build or skip
Malformed tag value	Free-form string: bad casing, control chars, locale separators	No enforced tag schema	Field — normalize or null
Encoding anomaly	Non-UTF-8 bytes in a string-table entry	PBF mandates UTF-8 string tables	Field — replace/strip, quarantine
Schema non-conformance	Required key absent after normalization	No spec requirement; your contract	Feature — quarantine to DLQ

The decisive distinction is scope. A blob decode error costs you one block; an unresolved reference costs you one feature; a malformed tag costs you one field. Collapsing these tiers — for example treating every exception as fatal — turns a 0.01% defect rate into a 100% failure. The string-table UTF-8 mandate is the one place where the spec is strict: a non-UTF-8 byte sequence indicates either a corrupt extract or a non-conformant producer, and it should always be quarantined rather than silently decoded with errors="replace", because a replacement character in a name tag is itself a data defect.

Step-by-step implementation Jump to heading

The core loop is a generator that yields one of two record types — committed or quarantined — and never raises across the chunk boundary. Build it in five steps.

Define a typed outcome. Model each record’s fate explicitly so the consumer cannot accidentally treat a quarantined record as clean.
Wrap decode and validation in scoped boundaries. Catch decode errors at block scope and validation errors at feature scope; let only genuinely fatal conditions (truncated archive, unreadable file) propagate.
Emit structured JSON logs. Capture chunk id, byte offset, exception type, and the offending key/ID so failures are queryable, not buried in a stack trace.
Trip a circuit breaker on systematic corruption. Track a rolling error rate; halt when it crosses a threshold so a corrupt region does not burn hours producing quarantine noise.
Checkpoint after every committed chunk. Persist the last committed chunk id so a restart resumes rather than reprocesses.

python

from __future__ import annotations

import logging
from collections.abc import Iterator
from dataclasses import dataclass
from enum import Enum

logger = logging.getLogger(__name__)

REQUIRED_KEYS: frozenset[str] = frozenset({"highway"})


class Outcome(Enum):
    COMMIT = "commit"
    QUARANTINE = "quarantine"


@dataclass(slots=True)
class Record:
    chunk_id: int
    osm_id: int
    tags: dict[str, str]
    outcome: Outcome
    reason: str | None = None


def validate_feature(chunk_id: int, osm_id: int, tags: dict[str, str]) -> Record:
    """Field- and feature-scoped validation; never raises on data defects."""
    # Encoding anomaly: string table must be UTF-8 (PBF spec mandate).
    for k, v in tags.items():
        if "�" in v or "�" in k:
            return Record(chunk_id, osm_id, tags, Outcome.QUARANTINE,
                          reason="non_utf8_string_table")
    # Schema non-conformance: required key absent after normalization.
    missing = REQUIRED_KEYS - tags.keys()
    if missing:
        return Record(chunk_id, osm_id, tags, Outcome.QUARANTINE,
                      reason=f"missing_keys:{','.join(sorted(missing))}")
    return Record(chunk_id, osm_id, tags, Outcome.COMMIT)

Block-scoped boundaries and the circuit breaker Jump to heading

Decode errors are scoped to the block, so the breaker counts block failures, not feature failures — one corrupt blob should not be amplified by the thousands of features it would have contained.

python

from __future__ import annotations

import logging
import zlib
from collections.abc import Iterator

logger = logging.getLogger(__name__)


class CircuitBreaker:
    """Halt when the rolling block-error rate exceeds a threshold."""

    def __init__(self, threshold: float = 0.05, window: int = 200) -> None:
        self.threshold = threshold
        self.window = window
        self._results: list[bool] = []  # True == error

    def record(self, *, error: bool) -> None:
        self._results.append(error)
        if len(self._results) > self.window:
            self._results.pop(0)

    @property
    def tripped(self) -> bool:
        if len(self._results) < self.window:
            return False
        rate = sum(self._results) / len(self._results)
        return rate > self.threshold


def stream_blocks(blocks: Iterator[tuple[int, bytes]],
                  decode, breaker: CircuitBreaker) -> Iterator[Record]:
    """Yield validated records; discard bad blocks; halt on systematic corruption."""
    for chunk_id, raw in blocks:
        try:
            features = decode(raw)  # may raise on a truncated/corrupt blob
        except (zlib.error, ValueError) as exc:  # block-scoped, recoverable
            breaker.record(error=True)
            logger.warning(
                "block decode failed",
                extra={"chunk_id": chunk_id, "byte_len": len(raw),
                       "error_type": type(exc).__name__},
            )
            if breaker.tripped:
                logger.error("circuit breaker tripped at chunk %d", chunk_id)
                raise RuntimeError("error rate exceeded threshold") from exc
            continue
        breaker.record(error=False)
        for osm_id, tags in features:
            rec = validate_feature(chunk_id, osm_id, tags)
            if rec.outcome is Outcome.QUARANTINE:
                logger.info("quarantined feature",
                            extra={"chunk_id": chunk_id, "osm_id": osm_id,
                                   "reason": rec.reason})
            yield rec

The extra= dictionary is what makes failures queryable: with a JSON log formatter, each warning becomes a structured event you can aggregate (GROUP BY reason) to distinguish a one-off corrupt blob from a region-wide tagging problem. Configuring the Python logging framework to emit JSON — rather than free text — is the difference between forensic analysis and grep archaeology.

Routing quarantine to a dead-letter partition Jump to heading

Quarantined records are not garbage; they are the review queue. Write them to a partitioned Parquet dead-letter store keyed by reason, so analysts can triage missing_keys separately from non_utf8_string_table.

python

from __future__ import annotations

import logging
from collections.abc import Iterator

import pyarrow as pa
import pyarrow.parquet as pq

logger = logging.getLogger(__name__)


def split_and_sink(records: Iterator[Record], dlq_root: str) -> dict[str, int]:
    """Commit clean records; partition quarantined ones by reason."""
    committed: list[dict] = []
    quarantined: dict[str, list[dict]] = {}
    for rec in records:
        if rec.outcome is Outcome.COMMIT:
            committed.append({"osm_id": rec.osm_id, **rec.tags})
        else:
            bucket = (rec.reason or "unknown").split(":")[0]
            quarantined.setdefault(bucket, []).append(
                {"osm_id": rec.osm_id, "reason": rec.reason, **rec.tags})
    for reason, rows in quarantined.items():
        pq.write_table(pa.Table.from_pylist(rows),
                       f"{dlq_root}/reason={reason}/part.parquet")
        logger.warning("quarantined %d records under reason=%s", len(rows), reason)
    return {"committed": len(committed),
            **{f"dlq_{k}": len(v) for k, v in quarantined.items()}}

Validation & error-handling matrix Jump to heading

Error condition	Root cause	Detection	Remediation
`zlib.error` on blob	Truncated download or corrupt `Blob` exceeding 32 MiB ceiling	`decode(raw)` raises	Discard block, log offset, increment breaker; re-fetch extract if rate climbs
`KeyError` on node ref	Way references a node absent from the extract	Reference lookup misses	Skip feature, log `osm_id`; if pervasive, the extract is truncated — abort
Dangling relation member	Member outside the bbox clip	Member resolve returns `None`	Partial-build with present members or skip; never raise
`UnicodeDecodeError` / `�`	Non-UTF-8 bytes in string table	Replacement char scan in `validate_feature`	Quarantine to `reason=non_utf8_string_table`; do not coerce
Missing required key	Tag dropped upstream or never mapped	`REQUIRED_KEYS - tags.keys()`	Quarantine to `reason=missing_keys`; route to manual review
`MemoryError` mid-chunk	Chunk too large for worker heap	OOM kill or allocation failure	Lower chunk size, force `gc.collect()` after commit
Runaway error rate	Systematically corrupt region	`CircuitBreaker.tripped`	Halt, alert, isolate offending byte range, re-tile
Duplicate commit on restart	No checkpoint; reprocessed chunks	Sink row count exceeds source	Read checkpoint manifest, skip committed chunk ids

Performance & scale considerations Jump to heading

The cost of error handling is dominated by two choices: chunk size and where garbage collection runs. Chunk size sets the granularity of both memory pressure and checkpoint recovery. Too large and a single MemoryError discards a lot of work and forces a long replay on restart; too small and per-chunk fixed costs (logging, breaker bookkeeping, Parquet flush) dominate. For dense urban extracts where multipolygon relations and high node density inflate per-feature memory, 250,000–750,000 features per chunk is a workable band; trigger an explicit gc.collect() after each successful commit, because GeoPandas/Shapely retain references in internal geometry caches that the cyclic collector will not otherwise reclaim promptly.

The circuit breaker’s window and threshold trade detection latency against false trips. A window of 200 blocks at a 5% threshold tolerates the occasional corrupt blob (expected on large downloads) while still halting within ~200 blocks of entering a systematically corrupt region. Setting the threshold too low makes a noisy-but-usable extract un-ingestable; too high lets the pipeline burn hours writing quarantine noise before stopping. The expected wasted work before a trip is bounded by:

W_{\text{wasted}} \approx \text{window} \times \bar{t}_{\text{block}}

where $\bar{t}_{\text{block}}$ is the mean per-block processing time — which is why a tighter window, not a lower threshold, is the right dial when you need to fail fast. When memory rather than corruption is the binding constraint, prefer the streaming generators in Memory-Efficient Chunk Processing over enlarging chunks to amortize fixed costs.

Failure modes & gotchas Jump to heading

Catching Exception at the chunk boundary hides truncation. A blanket except Exception will swallow the KeyError storm that signals a truncated extract, converting a fatal “your file is incomplete” into an endless quarantine stream. Catch the specific recoverable types (zlib.error, ValueError, KeyError) and let everything else propagate.
errors="replace" is data corruption, not error handling. Decoding a non-UTF-8 string table with replacement characters produces a record that looks valid and passes schema checks while carrying a corrupted name. Quarantine the record instead so the defect is visible.
The breaker must count blocks, not features. One corrupt blob can represent thousands of features; counting feature failures lets a single bad block trip the breaker spuriously, or — worse — mask a slow-burn corruption rate that never accumulates because each bad block contributes only one “error.”
Checkpoints written before the sink flush are lies. Persist the checkpoint after the Parquet write is durably flushed, never before; otherwise a crash between checkpoint and flush silently drops a committed chunk on restart.
Garbage collection after every feature, not every chunk, tanks throughput. gc.collect() is expensive; call it once per committed chunk, not in the inner loop.
Locale-dependent number parsing corrupts silently. A maxspeed of 1.200 means 1200 in some locales and 1.2 in others; never feed tag values through a locale-aware parser — the ambiguity belongs in the quarantine queue, not in a coerced float.

Integration points Jump to heading

Error handling is a middleware stage: it consumes the raw feature stream from the parser and emits a clean stream plus a dead-letter store. Downstream, the committed records feed topology assembly. The most common defect class quarantined here — malformed tags — has its own dedicated remediation procedure in Fixing malformed OSM tags during ETL ingestion, which reads the reason= partitions this stage writes and applies targeted regex repairs before re-submitting records. The wiring below couples the breaker-guarded stream to the sink and an idempotent checkpoint manifest:

python

from __future__ import annotations

import json
import logging
import sqlite3
from collections.abc import Iterator
from pathlib import Path

logger = logging.getLogger(__name__)


def load_checkpoint(db: sqlite3.Connection) -> int:
    """Return the highest committed chunk id, or -1 if none."""
    db.execute("CREATE TABLE IF NOT EXISTS ckpt(chunk_id INTEGER PRIMARY KEY)")
    row = db.execute("SELECT MAX(chunk_id) FROM ckpt").fetchone()
    return row[0] if row and row[0] is not None else -1


def run_ingest(blocks: Iterator[tuple[int, bytes]], decode,
               dlq_root: str, ckpt_path: str) -> dict[str, int]:
    """Drive the resilient stream with resumable, idempotent checkpointing."""
    db = sqlite3.connect(ckpt_path, isolation_level=None)
    db.execute("PRAGMA journal_mode=WAL")
    last = load_checkpoint(db)
    breaker = CircuitBreaker()

    fresh = ((cid, raw) for cid, raw in blocks if cid > last)  # skip committed
    stats = split_and_sink(stream_blocks(fresh, decode, breaker), dlq_root)
    # Checkpoint only AFTER the sink flush above has returned durably.
    db.execute("INSERT OR IGNORE INTO ckpt VALUES (?)", (stats.get("_last", last),))
    logger.info("ingest summary %s", json.dumps(stats))
    return stats

The committed Parquet output is then ready for projection to a working CRS per Coordinate Reference Systems in OSM and conversion into a routing graph via OSMnx Graph Conversion Techniques, whose own topology validation forms the second line of defense against the dangling nodes and self-intersecting geometries that slip past tag-level checks.

Deeper procedures in this area Jump to heading

Fixing malformed OSM tags during ETL ingestion — diagnostic profiling and targeted regex repairs for the malformed-tag records this stage quarantines.

Frequently Asked Questions Jump to heading

When should the pipeline skip a record versus halt entirely?

Scope decides. A blob decode error is block-scoped — discard the block and continue. An unresolved reference or schema violation is feature-scoped — quarantine the feature and continue. The pipeline only halts when the circuit breaker detects a systematic pattern (a sustained error rate above threshold), which signals a corrupt region or a truncated archive rather than isolated defects.

Why quarantine non-UTF-8 tags instead of decoding with errors="replace"?

A replacement character in a name or addr:street tag is itself a data defect that passes every schema check and silently corrupts downstream output. Quarantining keeps the defect visible and reviewable. The PBF spec mandates UTF-8 string tables, so a non-UTF-8 sequence indicates a corrupt extract or a non-conformant producer — both warrant human triage, not a silent coercion.

How do I make the ingest resumable after a crash?

Persist the last committed chunk id to a lightweight SQLite WAL manifest after the sink flush is durable, and on restart skip every chunk id at or below that checkpoint. Because chunk processing is idempotent — the same chunk always produces the same committed and quarantined records — replaying a not-yet-checkpointed chunk is safe, while replaying a checkpointed one is avoided entirely.

Should the circuit breaker count failed features or failed blocks?

Blocks. One corrupt blob can stand in for thousands of features, so counting features either amplifies a single bad block into a spurious trip or dilutes a real corruption rate. Tracking block-level success/failure over a rolling window gives a stable error-rate signal that maps to the actual recoverable unit.

What chunk size balances memory against restart cost?

For dense urban extracts, 250,000–750,000 features per chunk is a practical band. Larger chunks reduce per-chunk fixed overhead but lose more work to a single MemoryError and force longer replays; smaller chunks recover faster but pay more in logging, breaker bookkeeping, and Parquet flushes. Force gc.collect() once per committed chunk to release GeoPandas/Shapely cache references.

Async PBF Parsing with Pyrosm — the concurrent ingest whose workers produce the records this stage triages.
Value Standardization & Regex Cleaning — the canonical schema this stage validates against.
Batch Attribute Mapping Strategies — controlled vocabularies and fallback tables for cross-region tag harmonization.
Memory-Efficient Chunk Processing — streaming generators when memory, not corruption, is the binding constraint.
OSMnx Graph Conversion Techniques — topology validation that catches defects slipping past tag-level checks.
Fixing malformed OSM tags during ETL ingestion — targeted repairs for the malformed-tag records quarantined here.

This guide is part of Parsing & Tag Normalization Workflows; return to that overview to follow the data through normalization, error triage, and routing-graph conversion.

Error Handling in Large OSM Extracts Jump to heading#

Prerequisite concepts Jump to heading#

Specification & failure-surface reference Jump to heading#

Step-by-step implementation Jump to heading#

Block-scoped boundaries and the circuit breaker Jump to heading#

Routing quarantine to a dead-letter partition Jump to heading#

Validation & error-handling matrix Jump to heading#

Performance & scale considerations Jump to heading#

Failure modes & gotchas Jump to heading#

Integration points Jump to heading#

Deeper procedures in this area Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#

Error Handling in Large OSM Extracts Jump to heading

Prerequisite concepts Jump to heading

Specification & failure-surface reference Jump to heading

Step-by-step implementation Jump to heading

Block-scoped boundaries and the circuit breaker Jump to heading

Routing quarantine to a dead-letter partition Jump to heading

Validation & error-handling matrix Jump to heading

Performance & scale considerations Jump to heading

Failure modes & gotchas Jump to heading

Integration points Jump to heading

Deeper procedures in this area Jump to heading

Frequently Asked Questions Jump to heading

Related Jump to heading