Why not just use fillna() for missing OSM tags?

fillna collapses three distinct cases — legitimately absent, unmapped, and extraction artifact — into one fabricated value, which biases every downstream aggregate. A priority-ordered fallback chain backfills only from keys carrying the same signal and quarantines what it cannot resolve, instead of inventing data.

How do I tell a genuinely missing tag from an extraction artifact?

Treat null, empty strings, and coercion sentinels like 'nan' or 'None' as missing through a single shared mask, then measure coverage on the raw extract before any cleaning. A sudden coverage drop on a key that is normally well-populated signals an artifact or survey gap rather than legitimate absence.

When is it safe to apply a default value to a missing tag?

Only when the absence has a documented meaning, such as oneway defaulting to no. Keys whose absence is ambiguous, like maxspeed, must never be defaulted here; convert and infer them in the cleaning and mapping stages or quarantine the row instead.

What should the quarantine partition contain?

Rows still missing a required key after fallbacks and defaults, retaining their raw key columns plus a quarantine_reason string. Keeping the original payload lets a reviewer diagnose the failure without re-joining the source extract, and a stable quarantine count batch-to-batch confirms the fallback table is current.

Handling missing tags in OSM data pipelines Jump to heading

Resolve absent OSM keys — highway, surface, maxspeed, oneway, lanes — through deterministic fallback chains and route the unresolvable to quarantine, so a sparse contributor edit never silently downgrades a routing graph three stages downstream.

Prerequisites Jump to heading

Python 3.10+ (the snippet uses X | None union hints)
pandas>=2.1.0 and geopandas>=1.0.0 installed (pip install "pandas>=2.1.0" "geopandas>=1.0.0")
pyrosm>=0.6.2 for reading the .osm.pbf extract into a GeoDataFrame
psutil>=5.9 if you intend to gate ingestion on memory pressure
A regional extract to test against (any .osm.pbf from Geofabrik works)
A writable quarantine directory for the dead-letter Parquet partition

Why tags go missing Jump to heading

OpenStreetMap’s schemaless model guarantees contributor flexibility, but that freedom means any key can be absent on any element. Critical keys go missing for three distinct reasons, and they must not be treated the same way: a key is legitimately absent (a footpath has no maxspeed), it is unmapped (a road that simply has not been surveyed for surface), or it is an extraction artifact (a value clipped to an empty string or coerced to NaN during a spatial join). The first justifies a documented default; the second and third must be inferred or quarantined, never guessed. Distinguishing them is the whole job of this stage, which sits inside Batch Attribute Mapping Strategies and receives the quarantine routing that page defines.

A naive .fillna() violates OSM tagging semantics by collapsing all three cases into one fabricated value. The correct approach is a priority-ordered chain: try the primary key, then ranked secondary keys that carry the same signal, then a region-appropriate default, and only if all fail, quarantine the row. This presupposes that values have already been trimmed and case-resolved — that cleaning belongs to Value Standardization & Regex Cleaning, and the diagnostic below treats a whitespace-only or "nan" string as missing precisely because uncleaned input would otherwise read as present.

The complete solution Jump to heading

Run a coverage diagnostic first, then resolve fallbacks, apply regional defaults, and split valid rows from a quarantine partition. The module is self-contained against pandas>=2.1.0 / geopandas>=1.0.0:

python

"""Detect and resolve missing OSM tags, quarantining the unresolvable.

Requires: pandas>=2.1.0, geopandas>=1.0.0, pyrosm>=0.6.2, Python 3.10+.
"""
import logging

import geopandas as gpd
import numpy as np
import pandas as pd

logger = logging.getLogger(__name__)

# Strings that *look* present but are extraction artifacts, not real values.
SENTINELS = ["", "nan", "none", "NaN", "None"]

# Priority-ordered fallback chains: primary key -> ranked secondary keys.
FALLBACK_RULES: dict[str, list[str]] = {
    "highway": ["route", "railway", "waterway"],
    "surface": ["tracktype"],
    "maxspeed": ["maxspeed:forward", "maxspeed:backward", "zone:maxspeed"],
}

# Defaults applied ONLY where absence has a documented meaning per region.
REGION_DEFAULTS: dict[str, dict[str, object]] = {
    "EU": {"oneway": "no"},
    "US": {"oneway": "no"},
}


def _missing_mask(col: pd.Series) -> pd.Series:
    """True where a value is null, empty, or a coercion sentinel."""
    cleaned = col.astype("string").str.strip()
    return cleaned.isna() | cleaned.str.lower().isin([s.lower() for s in SENTINELS])


def diagnose_tag_coverage(gdf: gpd.GeoDataFrame, keys: list[str]) -> pd.DataFrame:
    """Quantify present/missing counts per key before any imputation runs."""
    total = max(len(gdf), 1)
    rows = []
    for key in keys:
        col = gdf.get(key, pd.Series(dtype="object"))
        missing = int(_missing_mask(col).sum()) if len(col) else total
        present = total - missing
        rows.append({
            "key": key,
            "present": present,
            "missing": missing,
            "coverage_pct": round(present / total * 100, 2),
        })
    report = pd.DataFrame(rows).set_index("key")
    logger.info("tag coverage:\n%s", report)
    return report


def resolve_missing_tags(
    gdf: gpd.GeoDataFrame, rules: dict[str, list[str]] = FALLBACK_RULES
) -> gpd.GeoDataFrame:
    """Backfill each primary key from its ranked fallback chain, in place."""
    gdf = gdf.copy()
    for primary, chain in rules.items():
        if primary not in gdf.columns:
            gdf[primary] = pd.NA
        mask = _missing_mask(gdf[primary])
        for fallback_key in chain:
            if fallback_key not in gdf.columns or not mask.any():
                continue
            donor_ok = ~_missing_mask(gdf[fallback_key])
            fill_here = mask & donor_ok
            gdf.loc[fill_here, primary] = gdf.loc[fill_here, fallback_key]
            logger.debug("filled %d %r from %r", int(fill_here.sum()), primary, fallback_key)
            mask = mask & ~fill_here  # only still-missing rows need the next link
    return gdf


def apply_regional_defaults(
    gdf: gpd.GeoDataFrame, region_code: str
) -> gpd.GeoDataFrame:
    """Backfill documented defaults (e.g. oneway=no) for the given region."""
    gdf = gdf.copy()
    defaults = REGION_DEFAULTS.get(region_code, REGION_DEFAULTS["EU"])
    for col, value in defaults.items():
        if col not in gdf.columns:
            gdf[col] = pd.NA
        filled = _missing_mask(gdf[col])
        gdf.loc[filled, col] = value
        logger.info("region %s: defaulted %d rows of %r to %r",
                    region_code, int(filled.sum()), col, value)
    return gdf


def split_quarantine(
    gdf: gpd.GeoDataFrame, required: list[str]
) -> tuple[gpd.GeoDataFrame, gpd.GeoDataFrame]:
    """Send rows still missing a required key to a dead-letter partition."""
    unresolved = pd.Series(False, index=gdf.index)
    for key in required:
        unresolved |= _missing_mask(gdf.get(key, pd.Series(index=gdf.index, dtype="object")))
    keep_cols = [c for c in (*required, *FALLBACK_RULES) if c in gdf.columns]
    quarantine = gdf.loc[unresolved, keep_cols].assign(
        quarantine_reason="missing_required_after_fallback"
    )
    valid = gdf.loc[~unresolved]
    logger.info("resolved %d valid, %d quarantined", len(valid), len(quarantine))
    return valid, quarantine

A typical driver wires the stages together, reading the extract once and emitting two partitions:

python

from pyrosm import OSM

def process_extract(pbf_path: str, region: str = "EU"):
    gdf = OSM(pbf_path).get_network(network_type="driving")
    diagnose_tag_coverage(gdf, ["highway", "surface", "maxspeed", "oneway"])
    gdf = resolve_missing_tags(gdf)
    gdf = apply_regional_defaults(gdf, region)
    valid, quarantine = split_quarantine(gdf, required=["highway"])
    return valid, quarantine

Step-by-step walkthrough Jump to heading

_missing_mask defines “missing” once. Every other function depends on it, so the policy that a whitespace-only or "None" string counts as absent lives in exactly one place. Casting to the nullable "string" dtype first avoids the object-array boxing that makes .str operations slow on large extracts.
diagnose_tag_coverage measures before it mutates. Run it on the raw extract and log the result. If highway coverage on a driving network drops below ~95%, that is a survey gap or an extraction bug to investigate — not something to paper over with defaults.
resolve_missing_tags walks the chain in rank order. For each primary key it recomputes the still-missing mask after every donor, so a row is only ever filled by the highest-priority fallback that actually has a value. The order of the list in FALLBACK_RULES is the policy; reordering it changes results, which is why it is data, not control flow.
apply_regional_defaults is deliberately separate. Defaults are the one place data is invented, so they are isolated, logged with a count, and keyed by region. oneway=no is safe to default because its absence has a documented meaning in OSM; maxspeed is not, which is why it never appears here.
split_quarantine refuses to guess. Any row still missing a required key after fallbacks and defaults is routed to a dead-letter frame that retains its raw payload and a reason string, so a reviewer can diagnose it without re-joining the source extract. This is the quarantine partition that Error Handling in Large OSM Extracts triages.

For planetary or continental files that exceed RAM, drive the same functions over bounded slices rather than one monolithic frame, gating on psutil.virtual_memory().percent and flushing intermediate Parquet between chunks — the streaming and spill patterns are covered by Memory-Efficient Chunk Processing.

Verification Jump to heading

Confirm the stage behaved before handing the result to a graph builder:

The coverage log shows present + missing == len(gdf) for every key, and coverage_pct for highway is near 100 on a network_type="driving" extract.
After resolve_missing_tags, re-running diagnose_tag_coverage on maxspeed shows higher coverage than before — the maxspeed:forward/backward donors filled real gaps.
split_quarantine returns a valid frame with zero missing highway values: assert _missing_mask(valid["highway"]).sum() == 0.
The quarantine frame’s row count is small and stable batch-to-batch. A sudden spike means a stale fallback table after a large import, not a code bug.
Defaulted rows carry the region value: (apply_regional_defaults(g, "EU")["oneway"] == "no").sum() equals the pre-default missing count for oneway.

Common errors and fixes Jump to heading

Error / symptom	Root cause	One-line fix
Every row reads as “present” despite blanks	`.notna()` alone misses `""` and `"nan"` strings	Use `_missing_mask`, which strips and matches the sentinel set
`KeyError` on a fallback key	The donor column is absent in this regional extract	Guard with `if fallback_key not in gdf.columns: continue`
Routing graph treats all roads two-way	`oneway` left null, builder defaults to bidirectional	Apply `apply_regional_defaults` before graph conversion
`maxspeed` filled with imperial numbers	Defaulted instead of cleaned/converted	Never default `maxspeed`; convert units in the cleaning stage
Quarantine count grows every run	Fallback table stale after an import	Audit recent changesets; add the new key variants to `FALLBACK_RULES`
`SettingWithCopyWarning` on `.loc` writes	Operating on a slice view	Call `.copy()` once at function entry (the snippet already does)

Spec reference Jump to heading

OSM places no schema constraint on which keys an element carries — any key may be absent — so “missing” is a pipeline concept, not a format error. The authoritative meaning of each key and whether absence is significant is defined in the OpenStreetMap Map Features and Tags documentation; treat those as the source of truth for which defaults are legitimate. The pattern-matching used to detect sentinel values follows the Python re module and pandas nullable string dtype semantics.

Batch Attribute Mapping Strategies — the mapping stage whose quarantine routing this page implements.
Value Standardization & Regex Cleaning — the cleaning that must precede missing-value detection.
Error Handling in Large OSM Extracts — triaging the dead-letter partition this stage emits.
Memory-Efficient Chunk Processing — streaming the same logic over extracts larger than RAM.
OSMnx Graph Conversion Techniques — where missing oneway/lanes silently corrupt topology if not backfilled first.
Tag Taxonomy & Key-Value Standards — the controlled vocabulary that decides which absences are meaningful.

This how-to belongs to the Batch Attribute Mapping Strategies guide — head back there for the full mapping stage, or up to Parsing & Tag Normalization Workflows for the broader pipeline.

Handling missing tags in OSM data pipelines Jump to heading#

Prerequisites Jump to heading#

Why tags go missing Jump to heading#

The complete solution Jump to heading#

Step-by-step walkthrough Jump to heading#

Verification Jump to heading#

Common errors and fixes Jump to heading#

Spec reference Jump to heading#

Related Jump to heading#