Grid systems for spatial analysis

This document explains the purpose and methods of using geospatial grid systems(such asS2 andH3)in BigQuery to organize spatial data in standardized geographic areas.It also explains how to choose the right grid system for your application. Thisdocument is useful for anyone who works with spatial data and performs spatialanalysis in BigQuery.

Overview and challenges of using spatial analysis

Spatial analytics helps to show the relation betweenentities (shops or houses) and events in a physical space. Spatial analyticsthat use the surface of the earth as the physical space is calledgeospatial analytics.BigQuery includes geospatial features and functions that enableyou to perform geospatial analysis at scale.

Many geospatial use cases involve aggregating data within localized areas, andcomparing statistical aggregations of those areas with each other. Theselocalized areas are represented as polygons in aspatial database table. In some contexts, thismethod is calledstatistical geography.The method of determining the extent of the geographic areas needs to bestandardized for better reporting, analysis, andspatial indexing.For example, a retailer might want to analyze the changes in demographics over timein areas where their stores are located, or in areas where they arecontemplating building a new store. Or, an insurance companymight want to improve their understanding of property risks by analyzingprevailing natural hazard risks in a particular area.

Due to strict data privacy regulations in many areas, datasets that containlocation information need to bede-identified or partially anonymizedto help protect the privacy of individuals represented in the data. For example,you might need to perform a geographic credit concentration risk analysis on adataset that contains data about outstanding mortgage loans. To de-identify thedataset to make it suitable for compliant analysis, you need to retain relevantinformation about the location of the properties, but avoid using a specificaddress or longitude and latitude coordinates.

In the preceding examples, the designers of these analyses are presented withthe following challenges:

How to draw the area boundaries within which you analyze changes over time?
How to use the existing administrative boundaries such as census tracts or a multi-resolution grid system?

This document aims to answer these questions by explaining each option,describing best practices, and helping you avoid common pitfalls.

Common pitfalls while choosing statistical areas

Business datasets such as real estate sales, marketing campaigns, ecommerceshipments, and insurance policies are suitable for spatial analysis. Often thesedatasets contain what appears to be a convenient spatial join key, such as acensus tract, a zip code, or the name of a city. Public datasets that containrepresentations of census tracts, zip codes, and cities are readily available,making them tempting to use as administrative boundaries for statisticalaggregation.

While nominally convenient, these and other administrative boundaries come withdrawbacks. Moreover, these boundaries might work well in the early stages of ananalytics project, but the drawbacks can be noticed in the later stages.

Postal codes

Postal codes are used to route mail in various countries around the world, anddue to this ubiquity, are often used to reference locations and areas in bothspatial and non-spatial datasets. Referring to the preceding example about themortgage loan, a dataset often needs to be de-identified before downstreamanalysis can be performed. Since each property address contains a zip code,zip code reference tables are accessible, making it a convenient option for ajoin key for spatial analysis.

A pitfall in using postal codes is that they are not representedas polygons, and there is no single correct source of truth for postal codeareas. Additionally, postal codes arenot a good representation of real human behavior.The most commonly used postal code data in the US is fromtheUS Census Bureau TIGER/Line Shapefiles,which contains a dataset calledZCTA5 (Zip Code Tabulation Area).This dataset represents an approximation of zip code boundaries that are derivedfrom mail delivery routes. However, some zip codes that represent individualbuildings have no boundary at all. This problem is present in other countries aswell, making it difficult to form a single global fact table that contains anauthoritative set of postal code boundaries that can be used across systems andacross datasets.

Additionally, there's no standardized postal code format used around the world.Some are numeric, ranging from three to ten digits, while some are alphanumeric.There is also an overlap between countries, making it necessary to store thecountry of origin in a separate column along with the postal code. Somecountries don't use postal codes, further complicating the analysis.

Census tracts, cities, and counties

There are some administrative units, such as census tracts, cities, and countiesthat don't suffer from the lack of an authoritative boundary. The boundaries ofcities, for example, are in most cases well established by governmentauthorities. Census tracts are well-defined by the US Census Bureau, and bytheir analogous institutions in most other countries.

A drawback of using these and other administrative boundaries is that theychange over time, and are not geographically consistent with one another.Counties and cities merge or break apart from one another and are occasionallyrenamed. Census tracts are updated once each decade in the US, and at differenttimes in other countries. Confusingly, in some cases the geographic boundary canchange but its unique identifier remains the same, making it difficult toanalyze and understand changes over time.

Another drawback that is common to some administrative boundaries is that theyare discrete areas with no geographic hierarchy. In addition to comparingindividual areas with one another, a common requirement is to compareaggregations of the areas themselves to other aggregations. For example, aretailer implementing theHuff modelmight want to run this analysis using multiple distances, which might notcorrespond to administrative areas that are used elsewhere in the business.

Single and multi-resolution grids

Single-resolution grids consist of discrete units that have no geographic relationto larger areas that contain those units. For example, postal codes have aninconsistent geographic relationship with the boundaries of larger administrativeunits, such as cities or counties that might contain zip codes. For spatialanalysis, it is important to understand how different areas are related toeach other without deep knowledge of the history and legislation that definesthe area polygon.

Multi-resolution grids are sometimes called hierarchical grids because cellsat each zoom level are subdivided into smaller cells at higher zoom levels.Multi-resolution grids consist of a well-defined hierarchy of units that arecontained within larger units. Census tracts, for example, contain block groups,which in turn contain blocks. This consistent hierarchical relationshipcan be useful for statistical aggregation. For example, by taking an average ofincomes of all the block groups contained in a tract, you can show the averageincome for that census tract containing the block groups. This wouldn't bepossible with postal codes because all postal areas are located at a singleresolution. It would be difficult to compare the income of a tract with itssurrounding tracts as there's no standardized way of defining adjacency, orcomparing income in different countries.

S2 and H3 grid systems

This section provides an overview of S2 and H3 grid systems.

S2

S2 geometry is an open source hierarchical grid systemdeveloped by Google and released to the public in 2011. You can use the S2 gridsystem to organize and index spatial data by assigning a unique 64-bit integerto each cell. There are 31 levels of resolution. Each cell is represented as asquare and is designed for operations on spherical geometries(sometimes calledgeographies).Each square is subdivided into four smaller squares. Neighbor traversal, whichis the ability to identify neighboring S2 cells, is less well-defined becausesquares can have either four or eight relevant neighbors depending on the typeof analysis. The following is an example of multi-resolution S2 grid cells:

Example of S2 grid cells.

BigQuery uses S2 cells to index spatial data and exposesmultiple functions. For example,S2_CELLIDFROMPOINTreturns the S2 cell ID that contains a point on earth's surface at a given level.

H3

H3 is an open source hierarchical gridsystem developed by Uber and used by Overture Maps. There are 16 levels ofresolution. Each cell is represented as a hexagon, and like S2, each cell isassigned a unique 64-bit integer. In the example aboutvisualization of H3 cells covering the Gulf of Mexico,the smaller H3 cells are not perfectly contained by the larger cells.

Each cell subdivides into seven smaller hexagons. The subdivision isn't exact,but it is adequate for many use cases. Each cell shares an edge with sixneighboring cells, simplifying neighbor traversal. For example, at each level,there are 12 pentagons,which instead share an edge with five neighbors insteadof six. Although H3 is not supported in BigQuery, youcan add H3 support to BigQuery using theCarto Analytics Toolbox for BigQuery.

While both S2 and H3 libraries are open source and available under theApache 2 license, the H3 libraryhas more detailed documentation.

HEALPix

An additional scheme to grid the sphere, commonly used in the astronomy field,is known as Hierarchical Equal Area isoLatitude Pixelation (HEALPix).HEALPix is independent ofhierarchical pixel depth, but the compute time remains constant.

HEALPix is a hierarchical equal-area pixelization scheme for the sphere. It isused to represent and analyze data on the celestial (or other) sphere. Inaddition to constant compute time, the HEALPix grid has thefollowing characteristics:

The grid cells are hierarchical, where parent-child relationships aremaintained.
At a specific hierarchy, cells are of equal areas.
The cells follow aniso-latitude distribution,allowing higher performance for spectral methods.

BigQuery does not support HEALPix, but there arenumerous implementations across a variety of languages, includingJavaScript,which makes it convenient for use in BigQuery user-definedfunctions (UDFs).

Example use cases for each indexing strategy

This section provides some examples that help you evaluate which is the bestgrid system for your use case.

Many analytics and reporting use cases involve visualization, either as part ofthe analysis itself or for reporting to business stakeholders. Thesevisualizations are commonly presented inWeb Mercator,which is the planar projection that is used by Google Maps and many other webmapping applications. In cases where visualization plays a vital role, H3 cellsdeliver a subjectively better visualization experience. S2 cells, especially athigher latitudes, tend to appear more distorted than H3, and don't appearconsistent with cells of lower latitudes when presented in a planar projection.

H3 cells simplify implementation where neighbor comparison plays an importantrole in the analysis. For example, a comparative analysis between sections of acity might help to decide which location is suitable for opening a new retail store ordistribution center. The analysis requires statistical calculations forattributes of a given cell that is compared with its neighboring cells.

S2 cells can work better in analyses that are global in nature, such as analysesthat involve measurements of distances and angles. Pokemon Go by Nianticutilizes S2 cells to determine where game assets are placed and how they aredistributed. The exact subdivision property of S2 cells ensures that gameassets can be evenly distributed across the globe.

What's next

For best practices for spatial clustering, seeSpatial Clustering on BigQuery - Best Practices.
Learn tocreate a spatial hierarchy from imperfect data.
Learn aboutS2 geometry on GitHub.
Learn aboutH3 geometry on GitHub.
Seeexamples that use H3, BigQuery, and Earth Engine.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換