Extracting metadata from OSM planet files Jump to heading

Stream the per-element provenance fields — uid, user, timestamp, version, changeset, and visible — out of a multi-gigabyte OSM PBF planet file into a flat table, without loading the file into memory and without resolving a single coordinate.

Prerequisites Jump to heading

Confirm each item before running the code below; a missing version pin or a coordinate index left switched on is the usual cause of either a parse error or a runaway 60 GB memory spike on a planet-scale file.

pyosmium ≥ 3.6.0 installed (pip install "osmium>=3.6") — it bundles libosmium and resolves PBF delta encoding internally.
Python 3.10+ for the list[...] builtin generics and structural pattern matching used here.
osmium-tool available on PATH (apt install osmium-tool) for the osmium fileinfo pre-flight integrity check.
A source extract on local disk — planet-latest.osm.pbf for the full archive, or a regional .osm.pbf / historical .osh.pbf for the visible field to be meaningful.
pyarrow ≥ 14.0.0 installed only if you intend to emit Parquet instead of CSV (pip install "pyarrow>=14").
Enough free disk for the output: a flat metadata table for the full planet is tens of gigabytes as CSV, roughly a third of that as Parquet.

Conceptual minimum Jump to heading

OpenStreetMap metadata is not spatial data — it is the provenance layer that records who edited each primitive, when, and in which changeset. These fields drive attribution tracking, contributor analytics, and licensing compliance, and they sit beside the geometry rather than inside it. In the PBF wire format covered by the PBF File Structure Deep Dive, each PrimitiveBlock carries metadata in an optional layer: dense nodes pack it into a single DenseInfo message, while ways and relations carry a per-element Info message. Within a block, uid, version, timestamp, and changeset are delta-encoded against the preceding element, and user is an index into the block’s shared StringTable — which is exactly why a naive byte scan or regex over the binary payload returns garbage.

You do not have to decode any of that by hand. pyosmium materializes the delta chains and resolves StringTable offsets for you, handing each callback a fully reconstructed object. The fields map directly onto the three element types of the Node-Way-Relation data model, so the handler below simply reads attributes off whatever object it is given. The one decision that governs memory is whether to build a coordinate location index: metadata extraction never needs geometry, so you switch that index off and the parser stays under a couple of gigabytes even on the full planet. When the metadata layer is absent — anonymized or stripped extracts, redacted history — uid is 0 and user is the empty string, and your code must treat that as a first-class case rather than a bug.

Runnable solution Jump to heading

The handler streams every primitive through pyosmium, captures the six provenance fields plus the element id and type, normalizes timestamps to ISO 8601 UTC, maps anonymized edits to a sentinel, and writes batched rows to CSV so memory stays flat regardless of file size.

python

import csv
import logging
import sys
from contextlib import closing

import osmium

logger = logging.getLogger("osm.metadata")

CSV_HEADER = ["id", "type", "uid", "user", "timestamp", "version", "changeset", "visible"]


class MetadataExtractor(osmium.SimpleHandler):
    """Stream provenance metadata from an OSM PBF/XML extract into a CSV file.

    pyosmium resolves PBF delta encoding and StringTable offsets internally,
    so the handler only captures the materialised attributes for each
    primitive. Run apply_file with locations=False so no coordinate index is
    built — metadata needs no geometry, and the index is what would otherwise
    blow memory on a planet-scale file.
    """

    def __init__(self, csv_file, batch_size: int = 150_000):
        super().__init__()
        self.writer = csv.writer(csv_file)
        self.writer.writerow(CSV_HEADER)
        self._file = csv_file
        self.batch_size = batch_size
        self.buffer: list[list] = []
        self.processed = 0

    def _flush_buffer(self) -> None:
        if not self.buffer:
            return
        self.writer.writerows(self.buffer)
        self._file.flush()
        self.buffer.clear()

    def _extract_meta(self, obj_type: str, obj_id: int, obj) -> None:
        ts = obj.timestamp
        uid = obj.uid
        # uid == 0 is the canonical signal for an anonymized / redacted edit.
        user = obj.user if uid != 0 else "anonymous"
        self.buffer.append([
            obj_id,
            obj_type,
            uid,
            user,
            ts.isoformat() if ts is not None else "",   # ISO 8601, UTC
            obj.version,
            obj.changeset,
            obj.visible,
        ])
        self.processed += 1
        if len(self.buffer) >= self.batch_size:
            self._flush_buffer()

    def node(self, n) -> None:
        self._extract_meta("node", n.id, n)

    def way(self, w) -> None:
        self._extract_meta("way", w.id, w)

    def relation(self, r) -> None:
        self._extract_meta("relation", r.id, r)


def extract(input_pbf: str, output_csv: str) -> int:
    """Extract metadata from input_pbf into output_csv; return primitive count."""
    with closing(open(output_csv, "w", encoding="utf-8", newline="")) as fh:
        handler = MetadataExtractor(fh)
        # locations=False => no location index is created, saving tens of GB.
        handler.apply_file(input_pbf, locations=False)
        handler._flush_buffer()
    logger.info("Extraction complete: %d primitives -> %s", handler.processed, output_csv)
    return handler.processed


if __name__ == "__main__":
    # Requires: pyosmium>=3.6.0, Python 3.10+
    # Usage: python extract_osm_meta.py planet-latest.osm.pbf osm_metadata.csv
    logging.basicConfig(level=logging.INFO)
    extract(sys.argv[1], sys.argv[2])

Step-by-step walkthrough Jump to heading

SimpleHandler subclass — pyosmium calls node, way, and relation once per primitive as it streams the file. You never hold more than one element at a time, so the file size is irrelevant to memory.
locations=False — passed to apply_file, this disables the coordinate location cache. Metadata carries no geometry, so skipping the index drops peak memory from tens of gigabytes to roughly 1–2 GB on the full planet.
_extract_meta capture — each callback forwards (type, id, obj) to one shared method that reads the six provenance fields directly off the materialised object; the delta decoding and StringTable lookup already happened inside libosmium.
Anonymized fallback — uid == 0 is the canonical marker for a redacted or anonymous edit, so user is replaced with the "anonymous" sentinel rather than emitting an empty cell that downstream joins would mishandle.
Timestamp normalization — obj.timestamp is a timezone-aware datetime; isoformat() yields an unambiguous UTC ISO 8601 string, and a None timestamp (possible in stripped extracts) degrades to an empty field instead of raising.
Batched writes — rows accumulate in self.buffer and flush every batch_size primitives, amortizing I/O while keeping the live buffer bounded; the final partial batch is flushed after apply_file returns.
visible semantics — the field is always True in regular planet files (deleted elements are absent), and only varies in historical .osh.pbf files where deletions are recorded as visible=False.

Verification Jump to heading

Confirm the output is correct before feeding it into an attribution or analytics pipeline:

Row count. The logged processed count must equal the sum of nodes, ways, and relations reported by osmium fileinfo -e planet-latest.osm.pbf — a shortfall means callbacks were silently skipped.
Header and arity. Every output row has exactly eight columns; a ragged row signals a field that came back None and was not handled.
Anonymized rows. Spot-check that every row with uid equal to 0 carries user equal to anonymous, and that no non-zero uid maps to an empty user string.
Timestamp monotonicity per changeset. Within a single changeset id, timestamps should fall inside that changeset’s open/close window; gross outliers indicate a parse misalignment.
Version sanity. For a current planet file every primitive has version >= 1; a 0 or negative version means the metadata layer was misread, not merely absent.

Common errors and fixes Jump to heading

Symptom	Root cause	One-line fix
Memory climbs to tens of GB and OOM-kills	Coordinate location index built by default	Pass `locations=False` to `apply_file`.
`RuntimeError` partway through the parse	Corrupted `Blob` or truncated `PrimitiveBlock`	Run `osmium fileinfo -e file.pbf` first; re-download if it errors.
All `user` cells empty, `uid` all `0`	Anonymized or metadata-stripped extract	Expected — map to the `"anonymous"` sentinel as shown.
`visible` is `True` for every row	Reading a regular planet file, not a history file	Use a `.osh.pbf` history extract if you need deletions.
`AttributeError` on `obj.timestamp`	`pyosmium` older than 3.x API	Upgrade to `osmium>=3.6.0`.
Garbage values from a hand-rolled regex parser	Metadata is delta-encoded against the `StringTable`, not plain text	Decode through `pyosmium`, never scan the raw bytes.

For very large deployments, swap the CSV writer for an Apache Parquet writer via pyarrow: columnar storage cuts the on-disk footprint by roughly 60–70% versus CSV and preserves predicate-pushdown query performance for downstream contributor analytics. Tag-based attribution (source=*, attribution=*, license=*) is a separate concern handled when you apply the conventions in the OSM XML vs PBF Comparison and cross-reference the OSM API /api/0.6/changeset/{id} endpoint.

Specification reference Jump to heading

In PBF, element metadata lives in the optional Info message (ways, relations) and the DenseInfo message (dense nodes), defined in osmformat.proto. Each of version, timestamp, changeset, and uid is delta-encoded within a PrimitiveGroup, and user_sid indexes the block-level StringTable. When metadata is omitted, uid is 0 and the user string index points at the empty entry. See the OSM Wiki PBF Format specification and the upstream osmformat.proto for the authoritative field definitions, and the pyosmium documentation for the handler API used above.

Frequently asked questions Jump to heading

Why disable the location index when extracting metadata?

The location index exists only to attach coordinates to nodes so way and relation geometry can be reconstructed. Metadata extraction reads provenance fields, never geometry, so the index is pure overhead — and on a planet file it is the single largest memory consumer. Passing locations=False keeps peak memory around 1–2 GB instead of tens of gigabytes.

How do I tell an anonymized edit from a missing field?

They are the same signal at the wire level: an anonymized or metadata-stripped element reports uid == 0 and an empty user string. Treat uid == 0 as the canonical test and substitute a sentinel such as "anonymous" so downstream joins and group-bys behave predictably.

Why is the visible field always true on a normal planet file?

A current planet snapshot contains only live elements; deleted primitives are simply absent, so visible is always True. The field only varies in historical .osh.pbf files, where each version of an element — including deletions recorded as visible=False — is retained.

Can I parse a planet file's metadata in parallel?

Yes, but not by seeking arbitrary byte offsets. The smallest safe split point is a PBF Blob boundary, so pre-tile the source with osmium extract and run one handler per tile; concatenate the per-tile tables afterward. The single-pass streaming handler above is already fast enough for most planet-scale metadata jobs.

PBF File Structure Deep Dive — how DenseInfo, Info, and the StringTable encode the fields this handler reads.
How to Decode OSM PBF Headers in Python — validate required_features and replication state before streaming data blocks.
Node-Way-Relation Data Model — the three primitive types each metadata row is keyed against.
OSM XML vs PBF Comparison — why the binary format, not XML, is the practical source for planet-scale extraction.
Error Handling in Large OSM Extracts — triage corrupted blocks and quarantine bad records at scale.
OSM Data Fundamentals & Architecture — the foundation this extraction stage sits within.

Up one level: PBF File Structure Deep Dive.

Extracting metadata from OSM planet files Jump to heading#

Prerequisites Jump to heading#

Conceptual minimum Jump to heading#

Runnable solution Jump to heading#

Step-by-step walkthrough Jump to heading#

Verification Jump to heading#

Common errors and fixes Jump to heading#

Specification reference Jump to heading#

Frequently asked questions Jump to heading#

Related Jump to heading#