Value Standardization & Regex Cleaning Jump to heading

Value standardization is the stage where contributor-typed strings become machine-comparable values, and the failure it prevents is the kind that never raises an exception. Consider a single road surface tagged surface=Asphalt in one regional import and surface=asphalt in another, with an invisible zero-width space appended by a copy-paste from a wiki table. To Python these are three distinct strings, so a group_by("surface") reports three categories instead of one, a paved/unpaved reclassification misses two of them, and a routing cost surface built downstream assigns different edge weights to identical pavement. The byte difference is undetectable to a human reviewer and survives schema validation untouched — it only surfaces as a quietly wrong isochrone or an inflated category count weeks later. This guide builds the deterministic cleaning layer that collapses those variants to one canonical value before any join, aggregation, or graph build can inherit the defect.

OpenStreetMap (OSM) data exhibits high semantic variance because of decentralized contribution, localized mapping conventions, and evolving community guidance. Within the broader architecture of Parsing & Tag Normalization Workflows, value standardization and regular-expression cleaning are the deterministic bridge between raw contributor input and production-ready geospatial assets. Mapping engineers, OSM contributors, GIS analysts, and Python ETL developers implement strict cleaning routines to resolve casing inconsistencies, strip non-printable control characters, and enforce controlled vocabularies before downstream spatial joins, routing calculations, or network analysis.

Prerequisite concepts Jump to heading

Three foundations should be in place before any cleaning rule runs. First, cleaning operates on the free-form key-value dictionary attached to each element, so the structure described in the Node-Way-Relation Data Model determines which values exist to clean — a way carries surface and maxspeed, a node carries different keys, and a relation carries others again. Second, value cleaning is strictly the step before mapping: this page produces trimmed, case-resolved strings, and Batch Attribute Mapping Strategies assumes those strings arrive clean so its registry lookups can be exact rather than fuzzy. Third, the canonical forms your rules emit should match the controlled vocabulary defined in Tag Taxonomy & Key-Value Standards; normalizing Asphalt to asphalt only helps if asphalt is the form the rest of the pipeline already targets.

Deterministic cleaning principles Jump to heading

Production spatial ETL requires idempotent transformations: applying the same cleaning sequence twice to identical input must yield byte-identical output. This requirement rules out non-deterministic operations such as locale-dependent case folding, where str.lower() on a Turkish locale maps I differently than on a C locale and silently produces two outputs for one input. Cleaning routines must also prioritize memory efficiency by leaning on precompiled patterns and vectorized operations rather than row-by-row evaluation — a constraint that becomes acute during Async PBF Parsing with Pyrosm, where bounded memory and strict serialization boundaries demand minimal intermediate object creation.

Error handling must be explicit and fail-safe. Malformed dictionaries, unexpected data types, and missing values should trigger controlled fallbacks rather than unhandled exceptions that terminate a planetary-scale run. Reproducibility is reinforced by documenting the cleaning sequence, versioning the controlled-vocabulary maps, and isolating transformation logic from I/O boundaries so the pure cleaning function can be unit-tested against fixed inputs.

Specification & character-class reference Jump to heading

Cleaning is only as trustworthy as the character classes it names, so the ranges each pattern targets deserve to be pinned down as precisely as a binary format. OSM tag values are UTF-8 strings with no length or content schema, which means every category below can and does appear in real extracts.

Character class	Range / example	Why it appears	Cleaning action
ASCII control chars	`\x00`–`\x08`, `\x0b`, `\x0c`, `\x0e`–`\x1f`, `\x7f`	Pasted from spreadsheets, stray editor bytes	Remove entirely
Tab / newline in value	`\t`, `\n`, `\r`	Multi-line text fields, import artifacts	Collapse to single space
Zero-width characters	ZWSP, `‌` ZWNJ, `‍` ZWJ, BOM	Copy-paste from rich text, RTL editing	Strip from both ends
Repeated whitespace	`"two spaces"`	Hand entry, concatenation	Collapse to one space
Mixed casing	`Asphalt`, `ASPHALT`, `asphalt`	No casing convention enforced	Resolve via vocabulary map
Trailing unit suffix	`50 mph`, `30 km/h`	Locale habits	Route to unit parsing (not bare lowercase)

Two rules govern the whole stage. Anchor every boundary-sensitive pattern with ^ and $ so a partial match cannot corrupt a value — an unanchored numeric extractor will pull 50 out of 50 mph and silently treat it as km/h. And never strip a unit suffix by deleting it: a value carrying units belongs in a dedicated unit parser, because dropping the suffix fabricates a measurement system. For authoritative pattern semantics consult the official Python re documentation, and validate target vocabularies against the OSM Wiki Tagging Guidelines.

Implementation: regex compilation & vectorized cleaning Jump to heading

The foundation of a robust routine is precompiled pattern objects, explicit type validation, and chunk-aware processing. Compiling patterns once at module load avoids per-call overhead across millions of records, while pandas string methods push iteration below the Python interpreter. The routine below cleans a single value, then a DataFrame chunk, then streams an entire extract to Parquet without ever holding the file in memory.

python

from __future__ import annotations

import gc
import logging
import re
from typing import Any

import pandas as pd

logger = logging.getLogger(__name__)

# Precompile patterns once — recompiling per row dominates wall-clock time at scale.
# Zero-width chars masquerade as "empty" yet defeat equality joins and group_by.
_ZERO_WIDTH = "‌‍"  # ZWSP, ZWNJ, ZWJ, BOM
STRIP_PATTERN = re.compile(rf"^[\s{_ZERO_WIDTH}]+|[\s{_ZERO_WIDTH}]+$")
MULTI_SPACE_PATTERN = re.compile(r"\s+")
NON_PRINTABLE_PATTERN = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]")

# Controlled vocabulary for deterministic case resolution (versioned alongside the pipeline).
CASE_NORMALIZATION_MAP: dict[str, dict[str, str]] = {
    "highway": {"Residential": "residential", "Primary": "primary", "Secondary": "secondary"},
    "surface": {"Asphalt": "asphalt", "Concrete": "concrete", "Gravel": "gravel"},
    "oneway": {"Yes": "yes", "No": "no", "True": "yes", "False": "no"},
}


def clean_tag_value(value: Any) -> str | None:
    """Sanitize a single tag value with explicit, fail-safe error handling."""
    if not isinstance(value, str):
        return None  # non-strings (None, ints) become a typed null, never a crash
    cleaned = STRIP_PATTERN.sub("", value)
    cleaned = NON_PRINTABLE_PATTERN.sub("", cleaned)
    cleaned = MULTI_SPACE_PATTERN.sub(" ", cleaned)
    return cleaned or None  # empty after cleaning is a null, not ""


def normalize_osm_tags_chunk(
    chunk: pd.DataFrame,
    tag_column: str = "tags",
    vocab_map: dict[str, dict[str, str]] | None = None,
) -> pd.DataFrame:
    """Apply deterministic regex cleaning and vocabulary mapping to a DataFrame chunk."""
    if vocab_map is None:
        vocab_map = CASE_NORMALIZATION_MAP
    if tag_column not in chunk.columns:
        raise ValueError(f"Missing required column: {tag_column!r}")

    def _clean_tag_dict(x: Any) -> dict[str, str | None]:
        if not isinstance(x, dict):
            return {}
        return {k: clean_tag_value(v) for k, v in x.items()}

    cleaned_tags = chunk[tag_column].map(_clean_tag_dict)

    # Apply the controlled vocabulary via direct dict lookup — exact, not fuzzy.
    remapped = 0
    for tag_key, mapping in vocab_map.items():
        for tag_dict in cleaned_tags:
            current = tag_dict.get(tag_key)
            if current in mapping:
                tag_dict[tag_key] = mapping[current]
                remapped += 1

    logger.info("cleaned %d rows, remapped %d values", len(chunk), remapped)
    chunk = chunk.copy()
    chunk[tag_column] = cleaned_tags
    return chunk


def process_large_osm_extract(
    df_generator,
    output_path: str = "normalized_osm.parquet",
) -> None:
    """Memory-efficient pipeline for processing large OSM extracts in chunks.

    Parquet is not an append-mode format, so we use pyarrow's ParquetWriter to
    append row groups inside one file. All chunks must share a compatible schema.
    """
    import pyarrow as pa
    import pyarrow.parquet as pq

    writer: pq.ParquetWriter | None = None
    try:
        for chunk in df_generator:
            normalized = normalize_osm_tags_chunk(chunk)
            table = pa.Table.from_pandas(normalized, preserve_index=False)
            if writer is None:
                writer = pq.ParquetWriter(output_path, table.schema, compression="zstd")
            writer.write_table(table)
            del normalized, table
            gc.collect()  # reclaim per-chunk buffers before the next iteration
    except Exception:
        logger.exception("extract cleaning failed; partial output at %s", output_path)
        raise
    finally:
        if writer is not None:
            writer.close()

The numbered sequence below is what these functions execute end to end:

Validate the input type. clean_tag_value returns None for any non-string, so malformed dictionaries degrade to typed nulls instead of raising mid-chunk.
Strip the edges. STRIP_PATTERN removes leading and trailing whitespace and zero-width characters in one pass, eliminating the invisible-suffix defect that fractures joins.
Remove control characters. NON_PRINTABLE_PATTERN deletes the ASCII control range while deliberately preserving \t, \n, \r for the next step to collapse rather than drop.
Collapse internal whitespace. MULTI_SPACE_PATTERN reduces any run of whitespace — including the tabs and newlines just preserved — to a single space.
Resolve casing via vocabulary. Each value is looked up in the versioned map by exact key, so Asphalt becomes asphalt deterministically and unmapped values pass through for audit.
Stream to Parquet. process_large_osm_extract writes ZSTD-compressed row groups one chunk at a time, calling gc.collect() between chunks so resident memory stays flat at planetary scale.

Validation & error-handling matrix Jump to heading

A cleaning stage is only trustworthy if it names the ways it fails and how each is caught. The matrix below is the minimum set of conditions a production routine should detect before any value is committed to an analytical store.

Failure condition	Root cause	Detection method	Remediation
Duplicate categories after group_by	Trailing zero-width char or whitespace	Compare cleaned vs raw distinct counts	Apply `STRIP_PATTERN`; assert stable cardinality
Locale-dependent case folding	`str.lower()` under a non-C locale	Run cleaning under two locales, diff output	Use explicit vocabulary maps, not `str.lower()`
Fabricated unit	Unanchored numeric regex matched `"50 mph"`	Anchored pattern returns the raw string instead	Route unit-bearing values to a dedicated parser
`TypeError` on non-string value	Tag value is `None` or numeric	`isinstance` guard returns `None`	Always type-check before regex; emit typed null
Empty string vs null ambiguity	Value reduces to `""` after stripping	`cleaned or None` collapses both to null	Treat absence as null, never `""`
Schema mismatch across chunks	A sparse chunk infers a different Parquet schema	`ParquetWriter` raises on `write_table`	Define schema explicitly from the first chunk
Silent pass-through of garbage	Value absent from vocabulary map	Log unmapped values; track audit count	Add canonical form to map; bump map version

Performance & scale considerations Jump to heading

The dominant cost in cleaning is rarely the regex engine itself but how often patterns are compiled and how data is laid out when they run. Three figures govern throughput. First, compile every pattern once at module load — recompiling inside a per-row callback can multiply wall-clock time by an order of magnitude on a multi-million-row chunk. Second, chunk size trades memory against scheduling overhead: chunks of roughly 1–5 million rows keep buffers in cache-friendly ranges while amortizing the fixed cost of map dispatch and Parquet row-group framing. Third, casting cleaned categorical columns such as surface to a pandas category dtype after normalization shrinks memory by an order of magnitude on high-cardinality extracts and accelerates the downstream group_by that mapping and validation perform.

When memory rather than CPU is the binding constraint, prefer narrowing the chunk and streaming over widening parallelism — the patterns in Memory-Efficient Chunk Processing keep each worker’s resident set bounded, whereas each additional parallel worker holds its own copy of the in-flight chunk. The explicit gc.collect() between chunks matters here: without it, the interpreter accumulates the millions of small tag dictionaries each chunk creates, and GC pressure rather than cleaning logic becomes the bottleneck.

Failure modes & gotchas Jump to heading

Zero-width characters survive a naive .strip(). Python’s str.strip() removes ASCII whitespace but not or the BOM, so the invisible suffix persists and the join still fractures. The combined regex class is what catches them.
str.lower() is not locale-safe. Lowercasing is tempting but produces different output under different locales and cannot encode domain rules like True → yes. An explicit vocabulary map is both deterministic and expressive.
Unanchored patterns corrupt numeric fields. A pattern without ^/$ extracts a partial match from 50 mph and treats it as already-normalized, fabricating a unit. Anchor every value-shaping pattern.
Empty string and null are not the same. A value that cleans to "" should become a typed null, otherwise downstream null-rate assertions and is_not_null filters miscount it as present.
Parquet has no append mode. Re-opening a file per chunk corrupts it; use a single ParquetWriter and append row groups, deriving the schema from the first chunk so sparse later chunks cannot drift it.
Mutating the source value in place destroys traceability. Keep the raw tag alongside the cleaned form where contributor feedback or quality reporting may need to reverse-engineer the original string.

Integration points Jump to heading

Cleaned values feed directly into the mapping stage, where exact registry lookups depend on the casing and whitespace having already been resolved. The wiring below shows the handoff: a stream of raw chunks is cleaned, the cleaned tag struct is expanded into typed columns, and the result is handed to mapping for vocabulary resolution and routing-graph preparation via OSMnx Graph Conversion Techniques. Because cleaning is a pure function of its input, the same chunk can be replayed safely on retry, and any value the vocabulary map cannot resolve is preserved for triage shared with Error Handling in Large OSM Extracts.

python

from __future__ import annotations

import logging
from collections.abc import Iterator

import pandas as pd

logger = logging.getLogger(__name__)


def clean_then_expand(
    df_generator: Iterator[pd.DataFrame],
) -> Iterator[pd.DataFrame]:
    """Clean each chunk, then expand the tag struct into flat columns for mapping."""
    for chunk in df_generator:
        normalized = normalize_osm_tags_chunk(chunk)
        # Promote the cleaned dict into addressable columns the mapping stage expects,
        # keeping the original struct for the audit trail.
        flat = pd.json_normalize(normalized["tags"]).add_prefix("tag_")
        out = pd.concat([normalized.reset_index(drop=True), flat], axis=1)
        logger.debug("expanded %d cleaned rows into %d columns", len(out), out.shape[1])
        yield out

The companion guide on automating tag case normalization with pandas shows the fully vectorized form of the casing step, replacing the per-dict loop above with column-level replace over the pandas C backend for high-throughput pipelines.

In this section Jump to heading

The guide below goes deeper into the highest-throughput form of the casing step:

Automating tag case normalization with Pandas — vectorized, C-backed casing resolution that replaces the per-dictionary loop for bulk extracts.

Frequently Asked Questions Jump to heading

Why not just call str.strip() and str.lower() on every value?

str.strip() removes ASCII whitespace but leaves zero-width characters and the BOM in place, so the invisible-suffix defect that fractures joins survives. str.lower() is locale-dependent and cannot express domain rules such as mapping True to yes or Yes to yes. A combined whitespace/zero-width regex plus an explicit, versioned vocabulary map is both deterministic across locales and able to encode the canonical forms your pipeline actually targets.

How do I keep cleaning idempotent?

Make the cleaning function pure — no locale-dependent operations, no random sampling, no time-based logic — so the same input always produces byte-identical output. Run it twice in a test and assert equality. Idempotency is what makes retries safe and lets a partial failure resume from the last committed chunk rather than restarting, which matters when a planetary extract takes hours to process.

Should an empty cleaned value become "" or null?

A typed null. A value that reduces to an empty string after stripping carries no information, and storing "" makes it count as present in is_not_null filters and null-rate assertions, biasing every downstream completeness metric. The cleaned or None idiom collapses both empty strings and falsy results to a single null representation so absence is unambiguous.

Where do unit-bearing values like "50 mph" belong?

Not in the lowercase/strip path. Stripping the unit suffix fabricates a measurement system, so values carrying units must go to a dedicated unit parser that converts to a canonical SI form. Use an anchored regex to detect the unit explicitly and convert; never let an unanchored pattern extract the bare number, because it will treat 50 mph as 50 km/h.

How large should each cleaning chunk be?

Roughly 1–5 million rows keeps pandas and Arrow buffers in cache-friendly ranges while amortizing the fixed cost of map dispatch and Parquet row-group framing. When memory is the binding constraint, narrow the chunk and stream rather than widening parallelism, since each parallel worker holds its own copy of the in-flight chunk, and call gc.collect() between chunks to reclaim the many small tag dictionaries each one allocates.

Batch Attribute Mapping Strategies — the registry mapping stage that consumes these cleaned values.
Automating tag case normalization with Pandas — the vectorized form of the casing step.
Memory-Efficient Chunk Processing — streaming and spill-to-disk when memory bounds the cleaning stage.
Error Handling in Large OSM Extracts — triaging values the vocabulary map cannot resolve.
Tag Taxonomy & Key-Value Standards — the controlled vocabulary the cleaning output targets.
OSMnx Graph Conversion Techniques — turning cleaned, mapped attributes into a routing graph.

This guide is part of Parsing & Tag Normalization Workflows; return to that overview to follow the data through ingestion, normalization, error triage, and routing-graph conversion.

Value Standardization & Regex Cleaning Jump to heading#

Prerequisite concepts Jump to heading#

Deterministic cleaning principles Jump to heading#

Specification & character-class reference Jump to heading#

Implementation: regex compilation & vectorized cleaning Jump to heading#

Validation & error-handling matrix Jump to heading#

Performance & scale considerations Jump to heading#

Failure modes & gotchas Jump to heading#

Integration points Jump to heading#

In this section Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#