Customer Case Study
Domain
Geospatial
Datatypes
Seismic
Geodetic
Challenge
Efficient distribution of large historical datasets
As the operator of the NSF's GAGE and SAGE Facilities, EarthScope Consortium supports a vast network of instruments generating a variety of geodetic, seismic, and related data. (Geodesy is the science of measuring the geometric shape of the earth, frequently using satellite positioning networks like GPS). EarthScope collects this geodetic and seismic data from thousands of terrestrial stations every day, and manages decades of historical data that is vital for open Earth science research.
In recent years, EarthScope began outgrowing the petabyte of local storage capacity at each of its two legacy SANs, and management of its on-prem systems had accrued significant technical debt. Historically, data in various specialized file formats (multi-dimensional RINEX, time-series miniSEED, and many other shapes & formats) were processed using a series of cron jobs in order to prepare them for researchers to download via FTP.
These challenges posed two main problems: large downloads and duplication of data.
Solution
Enter TileDB: A petabyte-scale DBMS at a fraction of the cost
As part of the merger process between IRIS and UNAVCO, the newly formed EarthScope Consortium planned its move from on-prem data facilities to a new common cloud platform. Amazon S3 was the obvious solution to their storage capacity problems, but issues with efficient data distribution and collaborative access remained.
The EarthScope engineering team initially evaluated dense Zarr arrays as a format that could replace RINEX files for their GNSS data (Global Navigation Satellite System), but the team ultimately decided on TileDB's flexible array storage to optimize their use of S3. Sparse TileDB arrays perfectly captured the multi-dimensional aspects of GNSS data — multiple frequency bands and satellites that, in turn, produce multiple measurements — and with query-ability designed to analyze data in-place on cloud object storage. These capabilities significantly reduced large downloads and the need to host multiple versions of datasets.
Today, their new cloud platform — architected around a modernized backend, Apache Kafka, TileDB arrays on S3, and Amazon CloudFront — is facilitating optimized data distribution that will fuel new ML and other techniques in the Earth sciences.
As of early 2024, here are some key highlights and early results: