Spatial Indexing for OSM Extracts Jump to heading

OpenStreetMap extracts are large, densely interconnected geospatial datasets, and the pipeline challenge this guide solves is blunt: without a spatial index, every bounding-box query, point-in-polygon test, and topology join degenerates into a full linear scan of millions of primitives. A single highway=* clip against an unindexed European extract can mean reading and testing hundreds of millions of geometries per request, turning a sub-second lookup into minutes of wasted I/O — and the cost compounds on every query thereafter. The failure is silent until scale arrives: a prototype that runs fine on a city extract grinds to a halt on a continent, because the work grew linearly while the data grew super-linearly. The defence is to build a deterministic spatial index once, so spatial predicates resolve in logarithmic time and downstream analytics, quality checks, and exports read only the candidates that can possibly match. This guide sits within the broader OSM Data Fundamentals & Architecture layer, which frames why fast spatial access underpins every serious OSM workflow.

Prerequisites: Concepts to Anchor First Jump to heading

This guide assumes three foundations. First, the Node-Way-Relation Data Model: you index reconstructed geometries, and a way only becomes a line or polygon once its node references are resolved into coordinates, so reference closure governs what is indexable at all. Second, the PBF File Structure Deep Dive, because the dense, delta-encoded node arrays and block framing of .osm.pbf are what let you stream geometries into an index without loading the whole file. Third, the WGS 84 storage model in Coordinate Reference Systems in OSM — OSM stores unprojected latitude and longitude, and the choice to index in geographic versus projected coordinates changes how every bounding box is computed. Readers comfortable with those three can treat the rest of this page as an indexing reference.

Indexing Architectures for OSM Primitives Jump to heading

OSM’s node-way-relation model demands indexing at multiple geometric granularities. Nodes store raw WGS 84 coordinates; ways define linear or polygonal geometries through ordered node references; relations encode multipolygon, route, and administrative hierarchies. A production index strategy must align with these primitives while accounting for tag-based filtering and the wide density gradient between dense urban cores and sparse rural extents. Three families of index dominate OSM work, and they are complementary rather than competing.

R-tree (bounding-box hierarchies): Optimized for arbitrary polygon, line, and point queries. Disk-backed implementations via libspatialindex (exposed in Python through rtree) provide efficient pruning for spatial joins and bounding-box filters by storing a tree of minimum bounding rectangles (MBRs). R-trees excel when query extents are irregular or when feature density varies sharply across a region, because the tree adapts its node shapes to the data.
Quadkey / grid partitioning: Divides the geographic extent into fixed-resolution cells along power-of-two splits — the same scheme that underpins slippy-map tiles and Bing-style quadkeys. Grid cells are cheap to compute (a pure function of coordinate and zoom), trivially parallelizable, and ideal for density analysis, point-in-polygon pre-filtering, and tile generation. The weakness is boundary fragmentation: a feature straddling a cell edge must be registered in every cell it touches.
Hierarchical spatial grids (H3 / S2): Provide near-uniform area coverage and deterministic neighbour traversal over the whole globe. H3’s hexagonal cells and S2’s spherical quadrilaterals mitigate the latitude distortion inherent in planar projections, which makes them the right tool for spatial aggregation, completeness sampling, and distributed validation where every cell should cover a comparable ground area.

Selecting an architecture depends on the downstream query, not on taste. Topology-validation pipelines that test precise intersections favour R-trees; tag-based analytics and completeness sampling favour hierarchical grids; tile rendering favours quadkeys. Many production stacks run two indexes side by side — an R-tree for exact spatial joins and an H3 column for coarse aggregation — because the two answer different questions. For a criterion-by-criterion decision matrix, see R-tree vs H3 vs Quadkey: Spatial Index Selection, the companion guide to this one that maps a workload to the index family that fits it.

Specification & Encoding Reference Jump to heading

The numbers that constrain index construction come straight from the formats OSM uses.

Property	Value / rule	Why it matters for indexing
Coordinate CRS	WGS 84, EPSG:4326 (implicit)	Bounding boxes are computed in degrees unless you reproject first
Coordinate storage	64-bit signed integers, nanodegree-scaled	Convert to float only at index insertion; integers stay exact
Default granularity	100 nanodegrees (≈1.1 cm at equator)	Storage precision far exceeds index precision needs
Longitude / latitude range	lon ∈ [−180, 180], lat ∈ [−90, 90]	Out-of-range values signal corrupt or clipped data; reject pre-insert
Way closure rule	first node ref == last, ≥4 nodes	Determines Polygon vs LineString before bounds are taken
R-tree MBR ordering	(minx, miny, maxx, maxy)	`rtree` and `libspatialindex` expect this exact tuple order

A bounding box for an R-tree is the tuple (min_lon, min_lat, max_lon, max_lat) — Shapely’s geometry.bounds already returns it in that order, which is why coordinates are stored as (lon, lat) pairs rather than the human-readable (lat, lon). For grid and quadkey schemes the controlling quantity is ground resolution at a given zoom level and latitude. In Web Mercator, the metres-per-pixel that sizes a quadkey cell is:

\text{resolution}(\varphi, z) = \frac{\cos(\varphi)\,\cdot\,2\pi R}{256 \cdot 2^{z}}

where $\varphi$ is latitude, $z$ the zoom level, and $R = 6{,}378{,}137\,\text{m}$ the Earth’s equatorial radius. The $\cos(\varphi)$ term is exactly the latitude distortion that hexagonal H3 cells are designed to avoid, and it is why a quadkey cell near the poles covers a tiny fraction of the ground area of one at the equator.

Step-by-Step: Building a Disk-Backed R-tree Jump to heading

The following pattern turns a streamed OSM extract into a persistent, query-ready R-tree using pyosmium for sequential PBF parsing and rtree for the index. It uses Python 3.10+ type hints and the project’s standard logger pattern, and it prioritizes memory efficiency, deterministic output, and non-blocking error recovery.

Select a location store. Pass locations=True to apply_file so pyosmium resolves node coordinates into each way automatically. Choose idx='flex_mem' (in-memory) for regional extracts under ~1 GB, or idx='sparse_file_array' (disk-backed) for continental and planet files to avoid an out-of-memory kill.
Reconstruct geometry per way. Read each node reference’s resolved location, decide Polygon versus LineString from the closure rule, and repair self-intersections with a zero-width buffer.
Insert the MBR. Take geometry.bounds and insert it under a monotonically increasing feature id, carrying the geometry as the stored object for later exact tests.
Finalize deterministically. Flush the index to disk and return the captured error log so defective primitives can be reviewed or quarantined.

python

import logging
import osmium
import rtree
from shapely.geometry import LineString, Polygon
from shapely.errors import TopologicalError

logger = logging.getLogger(__name__)


class OSMWayIndexer(osmium.SimpleHandler):
    """Build a disk-backed R-tree of way geometries from an OSM extract.

    Pass ``locations=True`` to ``apply_file`` so pyosmium resolves node
    coordinates automatically via its internal location store. The node()
    callback is not needed in that configuration — coordinates are available
    directly on each NodeRef in ``w.nodes`` via ``nr.location``.

    For continental extracts, use ``idx='flex_mem'`` (in-memory location
    store) or ``idx='sparse_file_array'`` (disk-backed) to avoid exhausting
    available RAM.
    """

    def __init__(self, index_path: str) -> None:
        super().__init__()
        self.index = rtree.index.Index(
            index_path, properties=rtree.index.Property(dimension=2)
        )
        self.feature_id: int = 0
        self.errors: list[dict] = []

    def way(self, w: osmium.osm.Way) -> None:
        try:
            coords: list[tuple[float, float]] = []
            for nr in w.nodes:
                loc = nr.location
                if not loc.valid():
                    self.errors.append(
                        {"type": "way", "id": w.id, "error": f"invalid location for node {nr.ref}"}
                    )
                    return
                # Store as (lon, lat) so bounds map to (minx, miny, maxx, maxy).
                coords.append((loc.lon, loc.lat))

            if len(coords) < 2:
                return

            if coords[0] == coords[-1] and len(coords) >= 4:
                geom = Polygon(coords)
            else:
                geom = LineString(coords)

            if not geom.is_valid:
                geom = geom.buffer(0)  # Standard self-intersection repair.

            self.index.insert(self.feature_id, geom.bounds, obj=geom)
            self.feature_id += 1

        except (TopologicalError, ValueError) as exc:
            self.errors.append({"type": "way", "id": w.id, "error": str(exc)})
            logger.warning("Geometry error on way %s: %s", w.id, exc)

    def finalize(self) -> list[dict]:
        """Flush the index to disk and return the error log."""
        self.index.close()
        logger.info(
            "Index finalized. %d features indexed, %d errors logged.",
            self.feature_id, len(self.errors),
        )
        return self.errors


# Usage:
# indexer = OSMWayIndexer("/tmp/osm_rtree")
# indexer.apply_file("extract.osm.pbf", locations=True, idx="flex_mem")
# errors = indexer.finalize()

Querying: Coarse Filter, Exact Refine Jump to heading

An R-tree returns candidates, not answers. Every spatial query is two stages: the index prunes the search to features whose MBRs overlap the query window, then an exact geometric predicate removes the false positives that a rectangle inevitably admits. Skipping the refine step is a common correctness bug — the index will happily return a feature whose bounding box overlaps but whose actual geometry does not.

python

from shapely.geometry import box


def query_window(index: rtree.index.Index, bbox: tuple[float, float, float, float]) -> list:
    """Return geometries that genuinely intersect a (minx, miny, maxx, maxy) window."""
    window = box(*bbox)
    hits = index.intersection(bbox, objects=True)          # coarse: MBR overlap
    return [item.object for item in hits if item.object.intersection(window)]  # exact

Because the geometry was stored as the index obj, the refine step never has to revisit the source file. For workloads dominated by tag filters — extracting only highway=* or building=* — apply the conventions in Tag Taxonomy & Key-Value Standards at insertion time and index each feature class into its own tree, so a query touches only the relevant subset rather than one monolithic index.

Validation & Error-Handling Matrix Jump to heading

Error handling in spatial ETL must be non-blocking: OSM data routinely contains orphaned nodes clipped by extract boundaries, unclosed rings, and self-intersecting polygons, and a single malformed primitive must never abort an hours-long index build. Each row below is a real failure seen in production indexing, with how to detect and remediate it.

Error condition	Root cause	Detection	Remediation
`invalid location for node`	Way references a node clipped by the extract boundary	`nr.location.valid()` is false	Skip and log the way; rebuild from a larger extract if closure matters
Self-intersecting polygon	Mapper error or ring digitized in a figure-eight	`geom.is_valid` is false	Repair with `geom.buffer(0)`; quarantine if area changes drastically
Coordinates swapped (lat/lon)	Bounds inserted as (lat, lon) not (lon, lat)	Reconstructed bbox falls outside region	Always insert `geometry.bounds`; store coords as (lon, lat)
Query returns non-overlapping features	Coarse MBR result used without exact refine	Visual or predicate spot-check	Re-test each candidate with an exact `intersects`/`contains` predicate
`MemoryError` on continental file	In-memory location store on a planet-scale extract	RSS climbs until OOM kill	Switch `idx` to `sparse_file_array`; index in disk-backed mode
Non-reproducible index across runs	Feature ids assigned by nondeterministic ordering	Diff of two builds differs	Insert in PBF id order; partition only on non-overlapping bboxes
Empty / single-point geometry	Degenerate way with < 2 distinct nodes	`len(coords) < 2`	Skip silently; these carry no spatial extent

OSM stores coordinates in unprojected WGS 84 (EPSG:4326). The recommended practice is to index in native WGS 84 and defer projection to the query or export stage, reprojecting with the pyproj Transformer API — a pattern walked through in converting OSM coordinates to a local CRS with pyproj. Indexing in geographic degrees keeps bounding boxes exact and avoids re-indexing whenever a downstream consumer needs a different projection.

Performance & Scale Considerations Jump to heading

Memory efficiency hinges on streaming consumption and the right location-store choice. When using apply_file(..., locations=True), pyosmium manages the node-location store internally. For regional extracts (under ~1 GB) idx='flex_mem' keeps the store resident and is fastest; for continental or planet-scale extracts idx='sparse_file_array' spills the store to disk, trading I/O throughput for a bounded memory ceiling and avoiding OOM kills. The R-tree itself, when constructed via a persistent rtree.index.Index(path, ...), is memory-mapped from disk, so the resident set stays modest even as the index grows to gigabytes.

Two tuning levers matter at scale. First, bulk loading: building an R-tree by streaming inserts is acceptable, but packing it from a pre-sorted geometry stream (an STR-pack) produces a flatter, better-balanced tree with materially faster queries — worth the extra pass on read-heavy indexes. Second, parallel construction: because PBF blocks are independent, you can split a planet file into geographic partitions, build one R-tree per partition on a separate worker, and union the partition results at query time. Partition on non-overlapping bounding boxes, never on arbitrary block offsets, so each worker holds a self-contained spatial region and the merged result has no duplicated or torn features. Query latency on a well-packed R-tree over a country-scale extract is typically sub-millisecond per bounding-box lookup, versus a full-scan baseline that grows linearly with feature count.

Failure Modes & Gotchas Jump to heading

Beyond the matrix, a few edge cases catch even careful implementations:

Coordinate tuple order. R-trees, libspatialindex, and Shapely’s bounds all use (minx, miny, maxx, maxy) — that is (lon, lat) order. Store node coordinates as (lon, lat) from the start; the lat/lon swap is the single most common indexing bug and it silently relocates every feature.
MBRs over-select on diagonal geometries. A long diagonal road has an MBR that covers a huge empty area, so it appears as a candidate for many unrelated queries. The exact-refine stage absorbs this, but on diagonal-heavy data the candidate set can be large — measure refine cost, not just index hits.
Boundary fragmentation in grids. Quadkey and grid schemes must register a feature in every cell its bounds touch, so a feature spanning a cell edge appears multiple times; deduplicate query results by feature id.
Float conversion drift. Coordinate precision must be normalized early. Keep the nanodegree integers exact through reconstruction and convert to float once, at insertion — converting repeatedly invites floating-point drift that makes two index builds diverge.
buffer(0) can change area. The zero-width buffer repairs self-intersections but on pathological rings it can drop a lobe or merge them. Compare pre/post area and quarantine geometries whose area shifts beyond a tolerance rather than indexing a silently mangled polygon.
Relations are not free geometry. Multipolygon relations require assembling member ways into rings before they have indexable bounds; index them in a relation-aware pass, not the way pass shown above.

Reproducibility & Pipeline Automation Jump to heading

Deterministic indexing is non-negotiable for reproducible geospatial workflows, and it rests on three practices:

Input checksum verification. Validate PBF integrity with a SHA-256 hash before parsing, so a partial download or storage bit-rot cannot silently poison the index.
Deterministic feature ordering. OSM primitives are serialized in id order within PBF blocks. Index insertion should respect that sequence, and any parallelization must use partitioned, non-overlapping bounding boxes to avoid race conditions and nondeterministic id assignment during concurrent writes.
Environment pinning. Lock Python dependencies, the C-extension version of libspatialindex, and the index parameters (leaf capacity, fill factor) in a configuration manifest, alongside the exact CRS, precision thresholds, and tag filters applied during ingestion — so a downstream analyst can replicate the index bit-for-bit.

Spatial indexing also accelerates compliance and licensing automation. By pre-computing bounding boxes and spatial relationships, a pipeline can rapidly identify features intersecting jurisdictional boundaries, apply region-specific licensing tags, or flag data requiring contributor attribution. For teams running continuous OSM updates, combining osmium apply-changes with incremental R-tree inserts turns raw extracts into query-ready assets without a full index rebuild on every cycle.

Integration Points: Feeding the Next Stage Jump to heading

The finished index is an input, not an endpoint. Its output is a fast candidate-resolution service that the normalization and analytics stages consume without knowing anything about R-tree internals — the clean boundary that the parsing and tag-normalization workflows build on. The wiring below joins two OSM datasets spatially, routing geometry failures to a quarantine in the discipline detailed in error handling in large OSM extracts:

python

def spatial_join(index: rtree.index.Index, probes: list) -> list[dict]:
    """Attach each probe geometry to the indexed features it intersects."""
    results: list[dict] = []
    for probe in probes:
        try:
            hits = index.intersection(probe.bounds, objects=True)
            matched = [h.object for h in hits if h.object.intersects(probe)]
            results.append({"probe": probe, "matches": matched})
        except (TopologicalError, ValueError) as exc:
            logger.warning("quarantining probe geometry: %s", exc)
            quarantine(probe, exc)  # provided by the normalization stage
    return results

For ingestion that produces the geometry stream this index consumes, the concurrent reader in async PBF parsing with pyrosm and the windowed approach in memory-efficient chunk processing apply the same block-boundary partitioning the parallel-construction section recommends.

Frequently Asked Questions Jump to heading

Should I index OSM data in WGS 84 or a projected CRS?

Index in native WGS 84 (EPSG:4326) and defer projection to the query or export stage. OSM stores coordinates in geographic degrees, so indexing in degrees keeps bounding boxes exact and avoids re-indexing whenever a consumer needs a different projection. Reproject candidate results on demand with the pyproj Transformer API.

R-tree, quadkey, or H3 — which spatial index should I use?

Match the index to the query. R-trees are best for exact spatial joins and irregular bounding-box queries against varying-density data. Quadkey and grid schemes suit tiling and point-in-polygon pre-filtering. H3 and S2 give near-uniform global cells for aggregation and completeness sampling. Many pipelines run an R-tree for exact joins alongside an H3 column for coarse aggregation, because they answer different questions.

Why does my query return features that do not actually intersect?

An R-tree query is a coarse filter: it returns every feature whose minimum bounding rectangle overlaps the query window, including false positives a rectangle inevitably admits. Always run an exact geometric predicate (intersects, contains) on the candidate geometries to remove them. Storing the geometry as the index object lets the refine step run without rereading the source file.

How do I keep memory bounded when indexing a continental or planet extract?

Use a disk-backed node-location store by passing idx=‘sparse_file_array’ to apply_file, and construct the R-tree as a persistent, memory-mapped rtree.index.Index on a file path rather than in memory. For regional extracts under ~1 GB, idx=‘flex_mem’ is faster and fits in RAM. Partition planet files into non-overlapping geographic regions and build one index per partition in parallel.

How do I make index builds reproducible?

Verify the PBF SHA-256 before parsing, insert features in PBF id order, pin the libspatialindex version and index parameters in a manifest, and partition any parallel build on non-overlapping bounding boxes so feature-id assignment is deterministic. Record the CRS, precision thresholds, and tag filters used so the index can be rebuilt bit-for-bit.

PBF File Structure Deep Dive — the streamed, delta-encoded source the index is built from.
Node-Way-Relation Data Model — the primitives whose reconstructed geometries are indexed.
Coordinate Reference Systems in OSM — why indexing in WGS 84 and reprojecting on query is the default.
Tag Taxonomy & Key-Value Standards — filtering feature classes into per-class indexes before insertion.
OSM XML vs PBF Comparison — why PBF is the practical input for scalable index construction.
Error Handling in Large OSM Extracts — quarantine and remediation for the defective geometries indexing surfaces.

This guide is part of the OSM Data Fundamentals & Architecture section — return there for the full map of the OSM data model, serialization formats, and ingestion foundations.

Spatial Indexing for OSM Extracts Jump to heading#

Prerequisites: Concepts to Anchor First Jump to heading#

Indexing Architectures for OSM Primitives Jump to heading#

Specification & Encoding Reference Jump to heading#

Step-by-Step: Building a Disk-Backed R-tree Jump to heading#

Querying: Coarse Filter, Exact Refine Jump to heading#

Validation & Error-Handling Matrix Jump to heading#

Performance & Scale Considerations Jump to heading#

Failure Modes & Gotchas Jump to heading#

Reproducibility & Pipeline Automation Jump to heading#

Integration Points: Feeding the Next Stage Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#

Spatial Indexing for OSM Extracts Jump to heading

Prerequisites: Concepts to Anchor First Jump to heading

Indexing Architectures for OSM Primitives Jump to heading

Specification & Encoding Reference Jump to heading

Step-by-Step: Building a Disk-Backed R-tree Jump to heading

Querying: Coarse Filter, Exact Refine Jump to heading

Validation & Error-Handling Matrix Jump to heading

Performance & Scale Considerations Jump to heading

Failure Modes & Gotchas Jump to heading

Reproducibility & Pipeline Automation Jump to heading

Integration Points: Feeding the Next Stage Jump to heading

Frequently Asked Questions Jump to heading

Related Jump to heading