Back

Nov 08, 2023

Why TileDB for Geometries

Geospatial
15 min read
Norman Barker

Norman Barker

VP of Geospatial

Spatial data is dominant in numerous applications, such as earth observation, location-based services, defense, biomedical and many more. Such data is very diverse and comes in various forms, including two important classes: (1) 2D or 3D points (e.g., LiDAR and SONAR) and (2) arbitrary 2D or 3D polygons (e.g., buildings, cell and nucleus boundaries and areas of interest). Querying such data entails submitting a rectangular or other arbitrary multi-polygon regions with a spatial operator such as contains or intersects with the points or polygons. These geospatial data and queries are commonly referred to as “geometries” that represent geographic data objects in the Geospatial world. Providing an efficient mechanism for handling querying and storing geometries at grand scale (hundreds of millions of billions of points and polygons) is a challenging task.

The Geospatial domain has well known query methods and storage encodings for geometries so we are focusing on geospatial applications in this blog. However, look out for a blog post on our innovative work with spatially resolved omics data where we combine the domain knowledge from the biomedical and geospatial experts within TileDB Inc.

Today we are announcing full geometries support in TileDB! TileDB is a multi-dimensional array database management system and in this blog I explain why it is a natural, super efficient choice for handling geometries. Databases such as PostgreSQL (and specifically its extension, PostGIS) and Oracle have had spatial capabilities for many years, whereas recently new open formats for storing geometries have come up, such as GeoParquet. TileDB brings a lot more to the table as compared to these other approaches. Here are a few takeaways:

  • TileDB integrates with GDAL and MariaDB, so you can plug it into a huge geospatial open-source ecosystem or use it as a replacement for PostGIS if you prefer SQL.
  • TileDB is 4x faster than GeoParquet when slicing a geometry dataset. GeoParquet requires a spatial partitioning scheme whereas TileDB internally implements spatial indexes (such as R-trees).
  • TileDB works on any storage backend, including scalable and inexpensive cloud object stores (AWS S3, Google Cloud Storage and Azure Blob Storage are all supported).
  • TileDB has a completely serverless, massively distributed compute infrastructure and can handle billions of geometries and tens of thousands of queries per second.
  • TileDB is a single, unified solution that manages the geospatial data objects along with the raw original data (e.g., images, text files, etc), the ML embedding models, and all the other data modalities in your application (tables, rasters, point clouds, etc).
  • You can get a lot of value from TileDB as a geospatial database either from our open-source offerings (MIT License), or our enterprise-grade commercial product (TileDB Cloud).

We worked with Even Rouault, the primary maintainer of GDAL, to implement support for geometries in the GDAL OGR driver which we use to ingest geospatial data into TileDB arrays.

In the remainder of the article, I will first cover the very basics on Geometries and TileDB. I will then explain how TileDB supports geometries and why it is a very powerful solution. Next, I will outline the competitive advantage of TileDB versus other approaches. Finally, I will conclude with our upcoming work on geometries.

For those who wish to jump directly on some action, read our TileDB 101: Geometries blog post.

Enjoy!

The Basics

What are Geometries?

Geospatial geometries are defined in ISO-19125 parts 1 and 2 and are primarily points, polygons, curves, and surfaces. The figure below shows some examples.

geometry type examples: point, linestring, multi-polygonpolygon bounding box over many building geometries
(b) ST_Intersects query over multiple building geometries
Map data by © OpenStreetMap, under ODbL. Buildings from Overture Maps Foundation, under ODbL.

Geometries can be applied as queries on 2D or 3D point cloud data, or they can be stored as the underlying data and queried by other geometries. In every case, storage compression and computational performance are very important.

Software that supports geometries includes GDAL, Fiona, R-Spatial and the broader open-source ecosystem that integrates with it (e.g., such as GeoPandas with pyogrio in Python), as well as PostGIS where you can use SQL statements to ingest and query your data. GeoParquet is another alternative, which can integrate with other database systems like DuckDB.

What is TileDB?

TileDB is a database system architected around dense and sparse multi-dimensional arrays. It consists of (1) an open-source C++ library called TileDB Embedded and a broad open-source ecosystem of APIs and integrations, and (2) a commercial product called TileDB Cloud. TileDB Embedded is the core array engine responsible for the open-spec array format, whereas TileDB Cloud is a full-fledged, secure database management system, which comes with a powerful serverless distributed computing platform, a holistic catalog, governance capabilities and interactive analysis tools.

The TileDB array model serves as the basis for many vertical applications, such as in Life Sciences, Geospatial and Machine Learning. We were fortunate to attract and build geospatial expertise within our team and dedicated resources to build vertical solutions for the Geospatial sector. Geometries is one of our geospatial vertical products.

The next section explains how we implemented geometries at TileDB, and why it is a great fit for such a use case.

Geometries in TileDB

TileDB has native support for sparse arrays, which can effectively store any multi-dimensional point in space. But how would TileDB store arbitrary shapes? The trick is to calculate and store the center of the envelope of the 2D or 3D geometry. In addition, for each geometry, we calculate the distances in X, Y and Z from the center of the envelope of that geometry to the envelope boundary. We finally store the maximum of all those distances as part of the array metadata. This value will be used during the query phase. The geometry itself is stored as a TileDB attribute in Well Known Binary format, and the geospatial metadata is maintained in the TileDB array or group.

geometries1

During any arbitrary spatial query, we first calculate the minimum bounding box of the query geometry, and then expand it by the maximum distance we calculated during ingestion. Any point in the array included in this expanded bounding box is either a result (i.e., an actual data geometry intersected or included in the query geometry) or not included in the actual query geometry. We then filter the returned results from the expanded bounding box query in a fast post-processing step.

geometries2

The cool thing about this implementation is that the centers of the geometry bounds, represented as sparse TileDB array cells, are natively indexed in TileDB with R-trees. That allows for super efficient pruning of unnecessary points that are guaranteed not to contribute the result, leading to superb query performance.

You can use the TileDB geometries in a variety of ways:

  • Vanilla TileDB arrays via a large number of APIs (C, C++, Python, Go, Java, C#, R, …)
  • Via GDAL, Fiona and using GeoPandas
  • Via MariaDB

Here are some code examples with GDAL / GeoPandas and MariaDB:

import fiona
fiona.drvsupport.supported_drivers["TileDB"] = "rw"
fiona.vfs.SCHEMES["tiledb"] = "tiledb"
import geopandas as gpd
import matplotlib.pyplot as plt
import shapely

x = 172.63
y = -43.53
circle = shapely.buffer(shapely.Point(x, y), 0.0025)
gdf = gpd.read_file(building_array, engine="pyogrio", use_arrow=True, bbox=circle.bounds)
for r in range(gdf.shape[0]):
    if gdf.geometry[r].intersects(circle):
        for g in gdf.geometry[r].geoms:
            plt.plot(*g.exterior.xy)

plt.plot(*circle.exterior.xy)
plt.show()

The same query with MariaDB and SQL:

tiledb.cloud.sql.exec(
    query="""
            SELECT names, ST_GeometryFromWkb(geometry) as geom
            FROM `tiledb://norman/overture_buildings`
            WHERE ST_Intersects(
                ST_GeometryFromWkb(geometry),
                ST_Buffer(ST_GeometryFromText('POINT(172.63 -43.53)'), 0.0025)
            );    
          """
    )
radius query
TileDB ST_INTERSECTS radius query over the Overture Map Foundation Building dataset

Other benefits with open-source TileDB:

  • Efficient object store support
  • Versioning and time traveling
  • Support for multiple different compressors
  • Schema evolution (e.g., for adding classification of geometries)

Extra benefits in TileDB Cloud:

  • A secure way to catalog and share your geometry data
  • Public datasets and notebooks
  • Easy notebook and dashboard environments for day-to-day analysis
  • Serverless, large-scale ingestion and analysis via task graphs

As a nice demonstration of the power of TileDB Cloud’s computational infrastructure, we worked with the buildings dataset from the Overture Maps Foundation that contains buildings from the entire world. Currently this dataset consists of 120 Parquet files that contain 785,524,851 buildings.

It was possible to ingest the whole dataset in parallel and completely serverless with the task graph shown in the figure below. The ingestion took 20 minutes, and the cost was just a few dollars. If you are interested in learning more about task graphs, I recommend reading the blog post TileDB Cloud: Task Graphs.

task graph ingestion on TileDB Cloud, spread across hundreds of serverless instances
TileDB task graph for ingesting the global Overture building dataset

After the ingestion, all buildings reside as geometries stored in a singleTileDB array on AWS S3. We use the spatial filter from this article to measure performance of the spatial queries against TileDB and others.

This array can be read through any one of our APIs or integrations. Querying the array through GDAL or our Python API for all buildings WITHIN the polygon boundary below takes about 12 seconds and returns 9,037 buildings (whereas querying using a specialized alternative like Apache Sedona takes about 48 seconds).

overhead view of Bellevue building geometries
Spatial query for buildings within a polygon boundary for the City of Bellevue, Washington

Yet one more benefit for adopting TileDB:

  • TileDB stores more than geometries. Its dense and sparse multi-dimensional arrays allow it to morph and capture any data modality, such as point clouds, rasters and tables – even flat files.

TileDB Differentiation

TileDB’s competitive advantage for geometries can be summarized as follows:

  • Performance: Geometries are represented as TileDB sparse array cells, which are indexed with R-trees. That yields superb performance.
  • Serverless: All deployment modes supported by TileDB are serverless, even when you need to outgrow your local machine and scale in the cloud. This has a direct impact on the operational cost, which TileDB minimizes.
  • Cloud-native: TileDB is uber optimized for object stores. As such, you can scale your vectors to cloud storage, while enjoying superb performance. Again, this leads to significant cost savings.
  • Multiple modalities: TileDB is not just a spatial database. TileDB envisions to store, manage and analyze all your geospatial, tabular, ML data and files.
  • Geospatial domain expertise: The team has industry experts in managing and processing geospatial data who can provide valuable data modeling and optimization insights.

In addition to the above, here is a brief qualitative and quantitative comparison versus other solutions you might know. In our experiments we used the building dataset from Overture Maps Foundation and tested with storing on S3 as well as local disk on a 8-core machine and 32 GB of RAM (m5a.2xlarge on AWS EC2) with the polygon boundary query above as the actual spatial filter. Note that the system caches were cleared before each test run and with the exception of PostGIS all datasets used ZSTD compression.

We include the preliminary benchmarks below and summarize the results in turn for each solution immediately after that.

performance testing with GDAL bench_ogr_batchperformance testing with GDAL bench_ogr_c_api
GDAL performance metrics

Notes:

  1. bench_ogr_c_api iterates over features with calls to GetNextFeature() in the GDAL API
  2. bench-ogr_batch can be used to measure the raw performance of the Arrow API within GDAL OGR.
  3. bench_ogr_batch could not be used with a partitioned GeoParquet dataset.
  4. GDAL_NUM_THREADS was set to be ALL_CPUS.
  5. PostGIS required using the local file system and was restarted and the disk cache cleared before each test.
  6. GeoParquet used the Wherobots GeoHash dataset with ZSTD compression so that the geometries could be read by GDAL 3.7.1. The GDAL client does not use Hive Partitioning.
  7. The “-spat” input argument to the test utility was the 2-D bounding box of the spatial filter (minx, miny, maxx, maxy) of (-122.235128, 47.574044, -122.181229, 47.657588).
TileDB performance metrics alonside PostGIS, Geopandas, Postgeese, Apache Sedona
Custom client performance metrics

Notes:

  1. PostGeese tested against the S3 public release from Wherobots.
  2. Apache Sedona was tested against the S3 public release from Wherobots
  3. PostGIS required using the local file system and was restarted before each test.
  4. GeoPandas+TileDB is using pyogrio and Apache Arrow.
  5. Server-less MyTile is required to be run in the cloud.
  6. TileDB-Py results will be as fast as geopandas+tiledb when we add support for geometry types in the library.
  • GeoParquet: GeoParquet is an incubating Open Geospatial Consortium standard that adds interoperable geospatial types (e.g. Point, Line, Polygon) to Parquet. It has early adoption despite not having an internal spatial index. The current approach is to index multiple parquet files using Hive partitioning which reintroduces the data management issue when dealing with data at scale. TileDB enables you to store your geospatial data objects in a single array with a performant spatial index. To be used efficiently GeoParquet requires spatial dataset partitioning, which is not supported by GDAL but is by Apache Sedona and Postgeese, without partitioning a full scan needs to be performed. Many tools cannot use Hive partitioning and in these cases TileDB is over 40x faster than GeoParquet. When using a spatial partition, such as when using PostGeese, TileDB is 4x faster.
  • ESRI: ESRI is a commercial vendor for GIS data, in particular analysis of geospatial data objects either as online maps or using desktop tooling along with cloud pipelines. The knowledge required to operate ESRI products requires specialist hires for your team. We did not purchase a license to perform benchmarks at this stage.
  • PostGIS: The elephant in the room! PostGIS extends the capabilities of the PostgreSQL relational database by adding support for storing, indexing and querying geographic data. TileDB goes further and indexes, versions and stores all of your data as TileDB arrays. TileDB is a “column store” allowing faster processing on the attributes of the data that you care about. It is also possible to use the Foreign Data Wrapper in PostgreSQL to migrate your data to TileDB arrays. When used through GDAL, TileDB is nearly 3x faster when performing a spatial query on the buildings dataset when using the new GDAL Arrow API, even when the data is stored on S3 TileDB outperforms PostGIS with local storage by 1.6x. When PostGIS is queried directly through the PostgreSQL CLI TileDB is still faster (1.4x). Additionally TileDB allows storage of your data in your object storage. Serverless SQL performance with MyTile is fast and does not require setting up PostGIS infrastructure. The OGR C API is the older way of accessing geometries and PostGIS outperforms TileDB, but this access approach is not recommended.
  • PostGeese: This is a spatial extension to DuckDB to enable geospatial processing. In our testing of loading the building outlines GeoParquet datasets into a DuckDB table and testing a spatial intersection TileDB outperforms DuckDB by 4x.
  • Apache Sedona: This is a cluster computing system for processing large-scale spatial data. In our experiments with the building dataset and testing a spatial intersection, TileDB is 4x faster.

TileDB Roadmap

Currently, the primary method of easily accessing geometries that are stored in TileDB arrays is through GDAL and libraries built on top of it such as Fiona/Shapely or R-Spatial. The spatial queries can also be performed using many of our integrations in C++, Java and Python but require decoding the Well Known Binary (WKB) and interpreting the geospatial metadata. In the near future, we plan to do the following:

  1. Store geometries directly in the R-tree as bounding boxes to prevent calculating centers of mass and expanding the query by the maximum distances from the centers. That will speed up the query, especially for datasets with skewed geometry sizes.
  2. Adding additional geometry types to our core TileDB library so that users outside of the GDAL ecosystem will know that the data stored in the blob is WKB.
  3. Support for geospatial queries directly within our own APIs, the spatial filtering of data will happen in the core TileDB library.
  4. Multi-dimensional spatial indexing – we will extend full (more than 2.5D) spatial filtering over multiple dimensions, e.g. X, Y and Z as used in collision detection between point clouds and CAD models.
  5. Spatial-temporal indexing such as those found in predictive and historical R-trees.
  6. More advanced visualizations with our integration into BabylonJS.

Supporting geometries in TileDB is exciting and enables users to support simpler workflows for processing their geospatial data. Stay tuned for more updates from the TileDB team!

What’s Next?

In this article I described the support for geometries in TileDB and how you can now store geographic data objects as TileDB arrays. You can get kickstarted with TileDB geometries by reading the blog TileDB 101: Geometries. You can find a plethora of more information about TileDB on our official website.

We’d love to connect with you and hear what you think! Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.

Want to see TileDB in action?
Norman Barker

Norman Barker

VP of Geospatial