How do I keep proper nouns and route references from being lowercased?

Mark those keys as preserve in the YAML rule set. The code also includes a regex guard that refuses to lowercase keys matching name, int_name, ref, website, wikidata, or source and logs an error instead of mutating them.

Why use vectorized .str accessors instead of apply()?

Pandas .str methods push iteration below the Python interpreter into C, while a row-wise apply() evaluates a Python callable per row. On a continental extract of millions of rows that difference is minutes versus hours, so vectorization is mandatory for ETL throughput.

How do I normalize an extract larger than RAM?

Process the file in chunks. Slice the source into fixed-size row windows, normalize each one, and append it through a single ParquetWriter so peak memory tracks one chunk rather than the whole dataset, keeping a sub-8 GB footprint on standard runners.

Why convert columns to category dtype after normalizing?

Once casing variants collapse to canonical forms, columns like highway and surface have very low cardinality. Casting them to category cuts memory 60-85% and lets pyarrow write dictionary-encoded Parquet, which preserves that efficiency on disk and on re-read.

Automating tag case normalization with Pandas Jump to heading

Collapse casing variants such as highway=Residential, Building=yes, and surface=Asphalt to their canonical lowercase form across a multi-gigabyte OSM extract in a single vectorized pandas pass — while leaving case-sensitive keys like ref, website, and name:en untouched.

Prerequisites Jump to heading

Confirm each item before running the code below; a skipped step is the usual reason a “normalized” frame still groups Asphalt and asphalt as two surfaces.

pandas ≥ 2.1.0 installed (pip install "pandas>=2.1") — the .str accessor behaviour and StringDtype semantics below assume the 2.x string backend.
pyyaml ≥ 6.0 (pip install "pyyaml>=6.0") for loading the declarative rule set.
pyarrow ≥ 14.0 installed, so categorical columns serialize to dictionary-encoded Parquet.
A tag-bearing DataFrame already extracted from PBF — produced upstream by Async PBF Parsing with Pyrosm — with one column per tag key.
A tag_normalization_rules.yaml file (template below) co-located with the script.
Python 3.10+ for the dict[str, str] and structural typing used here.
Optional: psutil if you want the adaptive chunk-resizing guard shown at the end.

Conceptual minimum Jump to heading

OpenStreetMap stores attributes as a free-form key-value map on every element, and nothing in the format enforces a casing convention — so the same real-world value arrives as Asphalt, ASPHALT, and asphalt from three different editors. Casing must therefore be resolved per key, not globally, because the correct strategy depends on what the key means: enumerated values defined in Tag Taxonomy & Key-Value Standards (highway, surface, amenity) are conventionally lowercase, whereas ref route numbers (A1, M25), website URLs, and name:* labels are case-sensitive and must be preserved verbatim. A blanket .str.lower() corrupts exactly the fields downstream joins and routing engines depend on.

This page is the dataframe-side counterpart to the streaming rewrite in Value Standardization & Regex Cleaning: it operates after parsing has already widened tags into columns, and it produces case-resolved strings that the registry lookups in Batch Attribute Mapping Strategies can then match exactly. Two requirements govern the implementation. First, the transform must be vectorized — pandas .str accessors push iteration below the Python interpreter, so a row-wise .apply() is the difference between minutes and hours on a continental extract. Second, it must be declarative: the key→strategy mapping lives in YAML, version-controlled and editable without touching code, so adding a new lowercase key never risks an accidental mutation of a case-sensitive one.

Runnable solution Jump to heading

This module loads a YAML rule set, applies the correct casing strategy to each named column using vectorized string operations and boolean masking, and downcasts high-cardinality columns to category before returning. It targets pandas>=2.1.0, pyyaml>=6.0, and Python 3.10+.

python

from __future__ import annotations

import logging
import re
from pathlib import Path
from typing import Any

import pandas as pd
import yaml

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger("osm.tag_case_normalizer")

# Enforce Copy-on-Write semantics (default in pandas 3.0; opt-in for 2.x).
pd.options.mode.copy_on_write = True

# Load the declarative rule set once at module import.
CONFIG_PATH = Path("tag_normalization_rules.yaml")
NORM_RULES: dict[str, Any] = yaml.safe_load(CONFIG_PATH.read_text(encoding="utf-8"))

# Precompiled patterns for the regex_clean strategy.
WHITESPACE_RE = re.compile(r"\s+")

# Keys that must never be lowercased regardless of the rule file, as a safety net.
PRESERVE_GUARD = re.compile(r"^(name(:[a-z]{2,3})?|int_name|ref|website|wikidata|source)$")


def normalize_osm_tags(df: pd.DataFrame) -> pd.DataFrame:
    """Vectorized, per-column case normalization for OSM tag columns.

    Each rule names a strategy: ``lowercase``, ``titlecase``, ``regex_clean``,
    or ``preserve``. Only columns listed in the rule set are touched, so
    geometry and metadata columns pass through untouched.
    """
    df = df.copy()
    rules: dict[str, str] = NORM_RULES.get("rules", {})
    target_cols = [c for c in rules if c in df.columns]
    if not target_cols:
        logger.warning("no rule columns present in frame; nothing to normalize")
        return df

    # Nullable string dtype avoids object-array memory bloat and keeps <NA> distinct from "".
    df[target_cols] = df[target_cols].astype("string")

    for col, strategy in rules.items():
        if col not in df.columns:
            continue
        if strategy == "lowercase" and PRESERVE_GUARD.match(col):
            logger.error("refusing to lowercase case-sensitive key %r; treating as preserve", col)
            continue

        mask = df[col].notna()
        if not mask.any():
            continue

        if strategy == "lowercase":
            df.loc[mask, col] = df.loc[mask, col].str.lower()
        elif strategy == "titlecase":
            df.loc[mask, col] = df.loc[mask, col].str.title()
        elif strategy == "regex_clean":
            df.loc[mask, col] = (
                df.loc[mask, col]
                .str.strip()
                .str.replace(WHITESPACE_RE, " ", regex=True)
                .str.lower()
            )
        elif strategy == "preserve":
            continue
        else:
            logger.warning("unknown strategy %r for column %r; skipping", strategy, col)

    # Downcast low-entropy enums to category for a 60-85% memory reduction.
    for col in target_cols:
        if NORM_RULES.get("rules", {}).get(col) in {"lowercase", "regex_clean"}:
            df[col] = df[col].astype("category")

    return df


def stream_normalize(src: Path, dst: Path, chunksize: int = 500_000) -> None:
    """Normalize an extract chunk-by-chunk and append to a single Parquet file."""
    import pyarrow as pa
    import pyarrow.parquet as pq

    writer: pq.ParquetWriter | None = None
    rows = 0
    try:
        for chunk in pd.read_parquet(src, dtype_backend="pyarrow").pipe(
            lambda d: (d.iloc[i:i + chunksize] for i in range(0, len(d), chunksize))
        ):
            out = normalize_osm_tags(chunk)
            table = pa.Table.from_pandas(out, preserve_index=False)
            if writer is None:
                writer = pq.ParquetWriter(dst, table.schema, use_dictionary=True)
            writer.write_table(table)
            rows += len(out)
            logger.info("normalized %d rows (cumulative)", rows)
    finally:
        if writer is not None:
            writer.close()  # flush the final row group


if __name__ == "__main__":
    stream_normalize(Path("tags-raw.parquet"), Path("tags-normalized.parquet"))

An example tag_normalization_rules.yaml that matches the pipeline:

yaml

rules:
  highway: lowercase
  surface: lowercase
  building: lowercase
  amenity: lowercase
  oneway: lowercase
  operator: titlecase
  description: regex_clean
  name: preserve        # Free-text label; never alter case
  ref: preserve         # Route references stay upper-case ("A1", "M25")
  website: preserve     # URLs are case-sensitive on many servers
  "name:en": preserve

Step-by-step walkthrough Jump to heading

Copy-on-Write up front — pd.options.mode.copy_on_write = True makes the .loc[mask, col] = ... writes predictable across pandas 2.x and 3.0, eliminating SettingWithCopyWarning and the silent no-op assignments it warns about.
Rules load once — the YAML is read at import, so the key→strategy map is a single source of truth that edits without redeploying logic. target_cols intersects the rule keys with the frame’s actual columns, so geometry and metadata never get mutated.
Nullable StringDtype — casting target columns to "string" keeps <NA> distinct from the empty string and avoids the per-object overhead of the default object dtype, which matters across millions of rows.
Boolean masking, not apply — mask = df[col].notna() restricts each vectorized .str call to non-null cells, so .str.lower() / .str.title() run in C rather than row-by-row in Python.
The preserve guard — PRESERVE_GUARD is a defence-in-depth check: even if a rule file mistakenly assigns lowercase to ref or a name:* key, the code refuses and logs an error instead of corrupting case-sensitive data.
regex_clean composition — strip, collapse internal whitespace to a single space via the precompiled WHITESPACE_RE, then lowercase, all chained on the .str accessor so the intermediate Series are never materialized as Python lists.
Categorical downcast — only the lowercase/regex_clean enums (low cardinality after normalization) are cast to category, cutting memory 60-85% depending on tag entropy and feeding dictionary-encoded Parquet.
Chunked streaming — stream_normalize slices the source into chunksize row windows and appends each normalized chunk through a single ParquetWriter, so peak memory tracks one chunk rather than the whole planet. This is the same memory discipline detailed in Memory-Efficient Chunk Processing.

Verification Jump to heading

Confirm the normalization is correct before handing the frame downstream:

Count the distinct surfaces. df["surface"].nunique() should drop after normalization; if Asphalt and asphalt still both appear, the rule for surface did not load.
Prove preservation. Assert that df.loc[df["ref"].notna(), "ref"].str.isupper().any() is still True — upper-case route refs must survive.
Check the log line. A refusing to lowercase case-sensitive key error means a rule file mistakenly targeted a protected key; fix the YAML, not the data.
Confirm dtype. df["highway"].dtype should report category, and df["name"].dtype should remain string.
Round-trip Parquet. Re-read tags-normalized.parquet and run the normalizer again — output must be byte-identical, proving the transform is idempotent.

Common errors and fixes Jump to heading

Symptom	Root cause	One-line fix
`ref` values lowercased to `a1`	Rule file set `ref: lowercase`	Set `ref: preserve`; the guard also blocks this and logs it.
`SettingWithCopyWarning`	Copy-on-Write disabled on pandas 2.x	Add `pd.options.mode.copy_on_write = True` before edits.
`AttributeError: Can only use .str accessor with string values`	Column still object/float dtype	Cast targets with `.astype("string")` before `.str` calls.
`<NA>` became the literal string `"<NA>"`	Lowercasing applied without the `notna()` mask	Restrict every assignment to `df.loc[mask, col]`.
Memory climbs to OOM on a planet file	Whole frame read before normalizing	Use `stream_normalize`; process one `chunksize` window at a time.
Categorical column rejected by Parquet	Mixed `<NA>` and category on old pyarrow	Upgrade `pyarrow>=14` or downcast after, not before, write.
`oneway` graph edges flipped	Casing of `Yes`/`-1` not normalized before graph build	Lowercase `oneway` here, before OSMnx graph conversion.

For extracts dirty enough that casing is the least of the problems, hand malformed rows to the quarantine path in Error Handling in Large OSM Extracts before this stage rather than letting .astype("string") raise.

Specification reference Jump to heading

OpenStreetMap tag values are free-form UTF-8 strings with no enforced casing; canonical lowercase enumeration is a community convention documented per key on the OSM Wiki — see Map features for the expected values and Key:ref for why reference values keep their original case. For the exact semantics of the patterns used in regex_clean, consult the official Python re documentation.

Value Standardization & Regex Cleaning — the parent stage covering whitespace, control-character, and vocabulary cleaning this casing pass complements.
Tag Taxonomy & Key-Value Standards — the canonical key-value reference that decides which keys are lowercase enums versus case-sensitive.
Batch Attribute Mapping Strategies — exact-match registry lookups that assume case-resolved input.
Memory-Efficient Chunk Processing — the chunk-and-stream discipline behind stream_normalize.
Best Practices for OSM Tag Standardization Across Regions — a streaming pyosmium approach to the same casing variance.
OSMnx Graph Conversion Techniques — the routing stage that breaks on un-normalized oneway and maxspeed casing.

Up one level: Value Standardization & Regex Cleaning.

Automating tag case normalization with Pandas Jump to heading#

Prerequisites Jump to heading#

Conceptual minimum Jump to heading#

Runnable solution Jump to heading#

Step-by-step walkthrough Jump to heading#

Verification Jump to heading#

Common errors and fixes Jump to heading#

Specification reference Jump to heading#

Related Jump to heading#