Case Study: EarthScope Consortium

EarthScope unlocks collaborative ML using TileDB for cloud-based geophysical data

EarthScope Consortium formed in 2023 from the merger of IRIS and UNAVCO to focus on how both organizations could collaborate to advance geophysics. From 2003 to 2018, IRIS and UNAVCO worked together as part of the EarthScope community project, an NSF-funded (U.S. National Science Foundation) program that deployed thousands of geophysical instruments to study the structure and evolution of the North American continent. EarthScope Consortium aims to provide cutting-edge geophysical instrumentation and data services, along with resources for education and workforce development of the next generation of scientists. Today, that data supports a wide range of academic research and use cases, from earthquake early warning systems to weather forecasts.

Domain

Geospatial

Datatypes

Seismic

Geodetic

Challenge

Efficient distribution of large historical datasets

As the operator of the NSF's GAGE and SAGE Facilities, EarthScope Consortium supports a vast network of instruments generating a variety of geodetic, seismic, and related data. (Geodesy is the science of measuring the geometric shape of the earth, frequently using satellite positioning networks like GPS). EarthScope collects this geodetic and seismic data from thousands of terrestrial stations every day, and manages decades of historical data that is vital for open Earth science research.

In recent years, EarthScope began outgrowing the petabyte of local storage capacity at each of its two legacy SANs, and management of its on-prem systems had accrued significant technical debt. Historically, data in various specialized file formats (multi-dimensional RINEX, time-series miniSEED, and many other shapes & formats) were processed using a series of cron jobs in order to prepare them for researchers to download via FTP.

These challenges posed two main problems: large downloads and duplication of data.

Researchers were functionally limited in how much archived data they could download, preventing them from collaborating on ambitious new machine learning and other data intensive projects.
For recently collected data, not all researchers want data sampled at the same rate or chunked to the same timeframe. This led to time-consuming re-processing and re-hosting of multiple versions of the same datasets, often across multiple formats.

Solution

Enter TileDB: A petabyte-scale DBMS at a fraction of the cost

As part of the merger process between IRIS and UNAVCO, the newly formed EarthScope Consortium planned its move from on-prem data facilities to a new common cloud platform. Amazon S3 was the obvious solution to their storage capacity problems, but issues with efficient data distribution and collaborative access remained.

The EarthScope engineering team initially evaluated dense Zarr arrays as a format that could replace RINEX files for their GNSS data (Global Navigation Satellite System), but the team ultimately decided on TileDB's flexible array storage to optimize their use of S3. Sparse TileDB arrays perfectly captured the multi-dimensional aspects of GNSS data — multiple frequency bands and satellites that, in turn, produce multiple measurements — and with query-ability designed to analyze data in-place on cloud object storage. These capabilities significantly reduced large downloads and the need to host multiple versions of datasets.

Today, their new cloud platform — architected around a modernized backend, Apache Kafka, TileDB arrays on S3, and Amazon CloudFront — is facilitating optimized data distribution that will fuel new ML and other techniques in the Earth sciences.

As of early 2024, here are some key highlights and early results:

EarthScope has already stored 12TB of GNSS data using TileDB, and is writing all new GNSS data to TileDB in near real time. The transition for active source seismic (PH5) data to TileDB is under way.
The engineering team is currently developing pipelines for moving its historical GNSS data archive (~500TB) to TileDB as well.
Early testing of this ETL process has resulted in the ability to convert 1 year´s worth of historical GNSS data to TileDB in roughly 4 minutes, a process that previously took weeks or months to complete.
The new architecture supports parallel processing of TileDB arrays, allowing researchers to harness the cloud to scale computations.
TileDB time-traveling and decoupling of data & metadata promote reproducibility and FAIR data access, essential for collaborative science.
EarthScope is also considering TileDB to store additional data types, including SAR rasters, fiber-optic DAS data, and readings from borehole strainmeters.

Ideally, we'd have a relational database that scales to many petabytes, but we can't. So TileDB sits at the happy medium of storing files in object storage and a relational database — but at a cost factor of 10x less to run a relational database that big. TileDB gives us many of the data management and distribution features of a relational database, but at a fraction of the cost. quotation-last

Henry Berglund

Engineering Manager III, EarthScope Consortium

Want to learn more about TileDB Cloud?

Capabilities Schedule a demo

Financial Services

Healthcare

Telecommunications

Reinsurance

Defense

Maritime

Earth observation

Oil and Gas

Single-cell

Population genomics

LiDAR

AIS

Weather

SAR imaging

Location data

Hyperspectral

Biomedical imaging

Dataframes

Financial data

Time-series data

Video data

Want to learn more about TileDB Cloud?