Why must I read the OSMHeader before any data block?

The HeaderBlock carries required_features, the bounding box, and replication metadata. It is the validation gate: a reader confirms it implements every required feature before touching a primitive. A malformed or out-of-spec header is a hard stop, because the rest of the file cannot be trusted.

Why are my bounding-box coordinates off by a factor of a billion?

The bbox fields are stored in nanodegrees (integer units of 1e-9 degrees). You must multiply each edge by 1e-9 to get decimal degrees in WGS 84. Skipping the conversion produces a systematic 1e9 offset, so the coordinates land nowhere near the source region.

Is the 4-byte length prefix part of the protobuf message?

No. The leading 4 bytes are raw big-endian framing read with struct.unpack('>I') and are not encoded as protobuf. Everything after them — the BlobHeader and Blob — is standard protobuf decoded through the compiled bindings.

Which compression field should I decompress for the header Blob?

Branch on Blob.HasField rather than assuming. zlib_data covers the overwhelming majority of files, with raw as the uncompressed fallback and lzma_data defined but rare. Decompressing the wrong field raises a zlib header-check error.

How to decode OSM PBF headers in Python Jump to heading

Decode the leading OSMHeader blob of a .osm.pbf file in Python to validate required_features and read the bounding box before you stream a single data block — getting this pre-flight step right is what stops an incompatible or corrupt extract from silently poisoning everything downstream.

Prerequisites Jump to heading

Python 3.10+ (the snippets below use match-free modern type hints)
protobuf>=4.21.0 installed in the runtime (pip install "protobuf>=4.21.0")
protoc compiler pinned to 3.21.12 or higher on your build machine
The canonical fileformat.proto and osmformat.proto from the OSM-binary repository
Generated fileformat_pb2.py and osmformat_pb2.py vendored beside your pipeline code
A sample extract (any regional .osm.pbf from Geofabrik works)

What the header actually is Jump to heading

A .osm.pbf file is a sequence of length-prefixed blocks, and the very first one is always a single OSMHeader blob. As the PBF File Structure Deep Dive sets out, every block is framed identically: a 4-byte big-endian uint32 giving the BlobHeader length, the BlobHeader message itself, then the compressed Blob payload whose size lives in BlobHeader.datasize. The header’s Blob, once decompressed, deserializes into a HeaderBlock — and that block is the contract for the rest of the file.

The fields you must read are required_features (capabilities your parser is obligated to implement, typically OsmSchema-V0.6 and DenseNodes), the bbox bounding box stored in nanodegrees, and the osmosis_replication_* provenance fields. Because PBF stores coordinates as scaled integers rather than floats — a detail covered under Coordinate Reference Systems in OSM — every bbox edge must be converted with $\text{degrees} = \text{nanodegrees} \times 10^{-9}$ before it means anything in WGS 84. That is the entire conceptual surface; the rest is binary framing and one decompression call.

The complete solution Jump to heading

First compile the schema once on your build machine:

bash

# protoc >= 3.21.12; run from a directory containing ./proto/*.proto
protoc --python_out=. --proto_path=./proto \
    ./proto/fileformat.proto ./proto/osmformat.proto
# -> generates fileformat_pb2.py and osmformat_pb2.py

Then the runnable decoder. Drop this beside the generated *_pb2.py modules and run it against any extract:

python

"""Decode and validate the OSMHeader blob of an OSM PBF file.

Requires: protobuf>=4.21.0, Python 3.10+, and compiled
fileformat_pb2 / osmformat_pb2 modules on the import path.
"""
import struct
import zlib
import logging

import fileformat_pb2
import osmformat_pb2

logger = logging.getLogger(__name__)

# Hard ceilings straight from the PBF specification. Validate the declared
# sizes against these *before* allocating, so a truncated or hostile file
# cannot trigger an unbounded read.
MAX_BLOB_HEADER_SIZE = 64 * 1024          # 64 KiB
MAX_BLOB_PAYLOAD_SIZE = 32 * 1024 * 1024  # 32 MiB
NANODEGREE = 1e-9


def decode_pbf_header(filepath: str) -> osmformat_pb2.HeaderBlock:
    """Read, frame-check and decompress the leading OSMHeader blob."""
    with open(filepath, "rb") as f:
        # 1. The 4-byte big-endian length prefix is raw framing, NOT protobuf.
        prefix = f.read(4)
        if len(prefix) != 4:
            raise ValueError("File too short to contain a BlobHeader length prefix")
        header_len = struct.unpack(">I", prefix)[0]
        if header_len > MAX_BLOB_HEADER_SIZE:
            raise MemoryError(f"BlobHeader length {header_len} exceeds 64 KiB ceiling")

        # 2. Parse the BlobHeader and confirm it really is the OSMHeader.
        header_data = f.read(header_len)
        if len(header_data) != header_len:
            raise ValueError("Truncated BlobHeader")
        blob_header = fileformat_pb2.BlobHeader()
        blob_header.ParseFromString(header_data)
        if blob_header.type != "OSMHeader":
            raise ValueError(
                f"Expected BlobHeader type 'OSMHeader', got '{blob_header.type}'"
            )
        if blob_header.datasize > MAX_BLOB_PAYLOAD_SIZE:
            raise MemoryError(
                f"Blob datasize {blob_header.datasize} exceeds 32 MiB ceiling"
            )

        # 3. Read exactly datasize bytes for the Blob payload.
        blob_data = f.read(blob_header.datasize)
        if len(blob_data) != blob_header.datasize:
            raise ValueError("Truncated Blob payload")

        header_block = _decompress_header_blob(blob_data)
        logger.info(
            "Decoded OSMHeader: features=%s, writingprogram=%r",
            list(header_block.required_features),
            header_block.writingprogram,
        )
        return header_block


def _decompress_header_blob(raw_blob: bytes) -> osmformat_pb2.HeaderBlock:
    """Select the active compression field and deserialize the HeaderBlock."""
    blob = fileformat_pb2.Blob()
    blob.ParseFromString(raw_blob)

    # Exactly one payload field is set. zlib_data dominates in practice.
    if blob.HasField("zlib_data"):
        decompressed = zlib.decompress(blob.zlib_data)
    elif blob.HasField("raw"):
        decompressed = blob.raw
    elif blob.HasField("lzma_data"):
        import lzma
        decompressed = lzma.decompress(blob.lzma_data)
    else:
        raise ValueError("Blob has no recognized compression or raw payload")

    header_block = osmformat_pb2.HeaderBlock()
    header_block.ParseFromString(decompressed)
    return header_block


def extract_bounding_box(header_block: osmformat_pb2.HeaderBlock) -> dict[str, float]:
    """Return the bbox in decimal degrees (EPSG:4326) from the HeaderBlock."""
    bbox = header_block.bbox
    return {
        "left":   bbox.left   * NANODEGREE,
        "right":  bbox.right  * NANODEGREE,
        "top":    bbox.top    * NANODEGREE,
        "bottom": bbox.bottom * NANODEGREE,
    }


def validate_header(header_block: osmformat_pb2.HeaderBlock,
                    supported: set[str]) -> None:
    """Reject the file if it requires a feature this parser cannot honour."""
    missing = set(header_block.required_features) - supported
    if missing:
        raise ValueError(f"Unsupported required_features: {sorted(missing)}")


if __name__ == "__main__":
    import sys
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")

    SUPPORTED = {"OsmSchema-V0.6", "DenseNodes"}
    hb = decode_pbf_header(sys.argv[1])
    validate_header(hb, SUPPORTED)
    print("required_features:", list(hb.required_features))
    print("optional_features:", list(hb.optional_features))
    print("bbox (deg):", extract_bounding_box(hb))
    print("replication_seq:", hb.osmosis_replication_sequence_number)
    print("replication_ts: ", hb.osmosis_replication_timestamp)

Step-by-step walkthrough Jump to heading

Read the length prefix (struct.unpack(">I", prefix)). The first four bytes are network byte order and stand outside any protobuf message — >I is a big-endian unsigned 32-bit integer. The header_len > MAX_BLOB_HEADER_SIZE guard runs before the next read so a bogus length never drives an oversized allocation.
Parse and identify the BlobHeader. After ParseFromString, the type field must equal OSMHeader; anything else means you are not at the start of the file or the pointer is misaligned. datasize is checked against the 32 MiB ceiling here, again before reading.
Read the Blob payload of exactly datasize bytes. A short read means truncation — treat it as fatal, not recoverable.
Decompress by field, not by guess. _decompress_header_blob inspects which payload field is set with HasField. The fields are mutually exclusive; zlib_data covers the overwhelming majority of real extracts, with raw and lzma_data as fallbacks. The decompressed bytes deserialize straight into a HeaderBlock.
Convert the bounding box. extract_bounding_box multiplies each nanodegree edge by $10^{-9}$ . Skipping this step produces a systematic $10^{9}\times$ offset, so coordinates land nowhere near the source region.
Gate on required_features. validate_header subtracts your supported set from the file’s required_features; a non-empty remainder is a hard stop, because honouring an unimplemented feature like DenseNodes is the difference between correct geometry and silently misread nodes — the same primitive graph described in the Node-Way-Relation Data Model.

Verification Jump to heading

Run the script against a known-good extract and confirm the output:

The log line reads Decoded OSMHeader: features=['OsmSchema-V0.6', 'DenseNodes'], ... — the two features present on virtually every modern file.
bbox (deg) values fall inside valid WGS 84 ranges: longitude in [-180, 180], latitude in [-90, 90]. For a Berlin extract, expect left/bottom near 13.0 / 52.3.
replication_seq is a non-zero integer for files cut from the replication stream (regional Geofabrik extracts include it).

Cross-check against the reference tool: osmium fileinfo -e your-extract.osm.pbf reports the same bounding box and header options. If your decoded bbox and osmium’s disagree, the nanodegree conversion is the first suspect.

Common errors and fixes Jump to heading

Error / symptom	Root cause	One-line fix
`struct.error: unpack requires a buffer of 4 bytes`	File opened in text mode or empty	Open with `"rb"` and check `len(prefix) == 4`
`Expected BlobHeader type 'OSMHeader'`	Reading mid-file or wrong offset	Seek to byte 0; decode only the first block as the header
`zlib.error: incorrect header check`	Decompressing the wrong field (e.g. `raw` as zlib)	Branch on `Blob.HasField(...)`, never assume zlib
Coordinates off by ~ $10^{9}$	Forgot the nanodegree scale	Multiply every `bbox` edge by `1e-9`
`MemoryError: ... exceeds 64 KiB / 32 MiB`	Corrupt length, or a non-PBF file	Validate the magic by checking `type == "OSMHeader"` first
`DecodeError: Error parsing message`	Stale or mismatched `*_pb2.py`	Recompile with `protoc >= 3.21.12` against the current `.proto`

Spec reference Jump to heading

The 4-byte big-endian length prefix, the BlobHeader / Blob framing, the mutually exclusive compression fields, and the 64 KiB / 32 MiB ceilings are all defined in the OpenStreetMap PBF Format specification. The required_features, bbox, and osmosis_replication_* fields are declared in osmformat.proto’s HeaderBlock message. Varint and length-prefix mechanics follow the Protocol Buffers encoding guide, and the struct format codes are in the Python struct documentation.

PBF File Structure Deep Dive — the full block-by-block wire format this header sits at the front of.
Extracting metadata from OSM planet files — reading provenance fields once the header validates.
OSM XML vs PBF Comparison — why the binary header exists at all.
Coordinate Reference Systems in OSM — the nanodegree-to-WGS 84 scaling applied to the bbox.
Spatial Indexing for OSM Extracts — using the header bbox to size an index before streaming data.
Error Handling in Large OSM Extracts — turning the validation failures above into quarantine and remediation.

This how-to belongs to the PBF File Structure Deep Dive guide — head back there for the rest of the wire format, or up to OSM Data Fundamentals & Architecture for the broader data model.

How to decode OSM PBF headers in Python Jump to heading#

Prerequisites Jump to heading#

What the header actually is Jump to heading#

The complete solution Jump to heading#

Step-by-step walkthrough Jump to heading#

Verification Jump to heading#

Common errors and fixes Jump to heading#

Spec reference Jump to heading#

Related Jump to heading#