Open SAR data and scalable analytics

TileDB joins forces with Capella Space on building SAR developer communities

Jun 23, 2020
norman-barker

Norman Barker

Principal Software Engineer, TileDB

Open SAR data and scalable analytics

SAR stands for Synthetic Aperture Radar. NASA has been using it since the 1970s to detect sub-meter changes to the Earth’s surface, stunningly through clouds, smoke, fog, and darkness. Capella Space is now taking SAR a step further, enabling high-cadence global monitoring for government and commerce.

The velocity and volume of SAR data collections calls for a new data management approach to be adopted by the geospatial application developers and scientists. Specifically, there is a need for a new multi-dimensional, cloud-native data format and storage engine that can be directly queried spatially and temporally, and easily analyzed using familiar tools.

This blog post will cover how we at TileDB, Inc. innovate in the space of SAR data management and why we partnered with Capella Space as they grow and expand their open data program and SAR developer community.

TLDR?

We introduce two innovations: (1)TileDB, a powerful open-source storage engine that can efficiently model SAR data collections as multi-dimensional arrays, and (2)TileDB Cloud, an elastic, serverless platform that enables planet-scale sharing and scalable analysis. In a nutshell, these offer the following benefits:

  • Ease of use: Once the data is in the TileDB format, it is analysis-ready and accessible with multiple programming languages and data science tools.
  • Planet-scale sharing: TileDB Cloud allows you to share data and code within or outside your organization without reingestion or extra copying while auditing all activity via detailed logs.
  • Scalability: You can perform from simple slicing queries to custom user-defined functions and task graphs, scaling easily to thousands of parallel tasks.
  • Cost Reduction: No egress costs to download and host copies of files. No overhead maintaining a cloud infrastructure. No idle compute that leads to cumulative losses. The TileDB storage engine and the TileDB Cloud serverless platform comprise the most affordable solution to SAR data management.

We have partnered with Capella Space to offer their SAR data on TileDB Cloud and address the scalable analytics needs of the SAR developer community. Join today the Capella Developer Community and sign up to TileDB Cloud to access the Capella Space SAR datasets and example notebooks.

Why TileDB and TileDB Cloud for SAR data?

SAR data files often need to be modeled as time-series datasets with complex values (amplitude and phase) for analysis, such as in change detection functions. TileDB is an open-source storage engine that introduces a cloud-native universal format that is ideal for this type of data. It models the collected data as chunked, compressed, versioned, multi-dimensional arrays and offers easy and efficient access. The TileDB storage engine offers extreme interoperability by being integrated with the broader data science ecosystem and geospatial tooling.

TileDB Cloud is a serverless platform built on the top of the TileDB storage engine that aims at alleviating the engineering hassles for geospatial developers, such as cluster sizing, deployment and monitoring. TileDB Cloud allows these developers to focus on reproducible and shareable analytics while enjoying extreme scalability. Accessing data and performing geospatial analytics on a platform like TileDB Cloud is all about ease-of-use and performance.

TileDB and TileDB Cloud collectively offer four main capabilities, described in more detail below.

Analysis-ready data, not files

A SAR dataset has two main characteristics: (1) it is a collection of numerous files, and (2) it has a very large size. To avoid the potentially enormous data hosting costs, it is preferable to store the data on inexpensive cloud object stores (such as AWS S3). Accessing and analyzing such data at petabyte scale becomes very challenging. While each individual SAR file might be defined perfectly, SAR analysis typically involves accessing numerous files, and potentially selective portions of these files, based on different parameters and metadata. Metadata handling and indexing currently falls solely on the user’s shoulders. Moreover, to access the data, you typically need to download the entire SAR files relevant to the analysis and wrangle the data in some other format that your favorite tool understands. This process is cumbersome and costly.

TileDB allows you to store the numerous SAR files as a coherent multi-dimensional array. All metadata, indexing, updates, slicing, compression, cloud backend optimizations, etc. are all abstracted by TileDB and you do not need to worry again about files. TileDB arrays can be updated with versioning that allows you to time-travel. Most importantly, the TileDB arrays can be queried directly by your favorite language (C, C++, Python, R, Go, Java) and tools (GDAL, Spark, Dask, even databases like MariaDB and PrestoDB). In other words, your data in TileDB is analysis-ready; there is no need for copying, downloading or wrangling.

Easier sharing

Sharing your SAR files with your peers is a pain, especially if you wish to impose access policies. This is true even on cloud services like AWS. The main reason is that the cloud services understand file semantics, whereas what you need is spatiotemporal semantics (e.g., I want to share this geospatial region in this time interval). The information you wish to share may be located in numerous files, and numerous potentially non-contiguous file byte regions. It is simply impractical to reason around file-based access policies.

This means that you need to build your own scalable infrastructure to handle the metadata, your customers/collaborators, the efficient querying, auditing, etc. This is extremely expensive in terms of resources for your organization, and also wasteful in terms of human hours.

TileDB allows you to store your data in arrays (instead of files), and TileDB Cloud provides all the cloud infrastructure you need to share those arrays at planet-scale. This includes scalable slicing, imposing all access policies, and auditing all access, all at the array level (a natively spatiotemporal object) instead of the file level. The way it works is that you store your data as TileDB arrays on your AWS S3 bucket, and you “register” your arrays with TileDB Cloud. Then everything is handled by TileDB Cloud without copying or moving the data. You practically give us the means to govern the access to your data; you continue to own your data and you can access it anytime outside of TileDB Cloud with the open-source TileDB engine. TileDB Cloud allows you to define the access policies with any other TileDB Cloud user, and will dictate all access after that point, providing you with detailed logs. TileDB Cloud charges in a pay-as-you-go manner for data access, not data hosting. Finally, you can always export subsets of TileDB arrays to COG files for your traditional workflows.

Serverless data management

In addition to easy planet-scale sharing, we strive to help the SAR community make fast discoveries and alleviate all pains related to deploying scalable analysis. With TileDB Cloud, you can perform your geospatial analysis on the SAR data you have access to at extreme scale, via user-defined tasks and workflow graphs, which are handled by TileDB Cloud in a serverless, pay-as-you-go manner. This means that you do not need to deploy a cluster, decide on its size, monitor it, etc. You just define the tasks and TileDB Cloud will execute them in parallel by elastically scaling its compute. TileDB Cloud charges in a pay-as-you-go fashion, eliminating costs from idle compute. Analyzing SAR data has never been simpler than what TileDB Cloud offers.

Notebooks for geospatial developers

The most recent release of TileDB Cloud offers hosted Jupyter Notebooks, which enable developers to query public and permissioned datasets without having to spin up their own separate compute environment. This effectively allows you to sign up, sign in and code within seconds. Also the Jupyter notebooks are easily shareable and, therefore, it is extremely easy to reproduce a scientific analysis with somebody else’s code, on anyone’s data you have permission to access.

Capella’s open SAR imagery within TileDB Cloud with array-based data access in a Jupyter Notebook

Partnering with Capella Space: an industry pace setter

Proving out the value of TileDB Cloud for the SAR community

For the past year, TileDB has been working closely with Capella Space, a US space company founded by Payam Banazadeh that delivers 24-hour all-weather Earth observation imagery of anywhere on the globe. Capella is developing space-based radar earth observation satellites equipped with synthetic aperture radar and delivering innovative SAR imagery data products to the market. Frequent revisit and rapid delivery do not in themselves make a significant impact, unless the SAR imagery is high quality and easily accessible via self-serve online access. Towards this end, Capella is pioneering the next generation of an open developer program and we are excited that TileDB Cloud is a key capability to help scale self-service data access and analytics to thousands of geospatial developers.

According to Scott Soenen, VP Product Engineering, “The partnership with TileDB gelled perfectly with our desire to deliver a new level of innovation in open data programs aimed at the geospatial community. Open data alone isn’t enough - it’s also about easy access to compute resources and versatility of analytics. TileDB Cloud removes multiple manual steps in data access for the geospatial developer community and offers intuitive self-service and interactive analytics. It enables users of Capella SAR data to dive straight into deriving insights from the data at scale.”

Conclusion

These are exciting times in the SAR community. We are happy to be able to contribute with TileDB and TileDB Cloud, and we are thankful for Capella’s collaboration and leadership. We are eager to hear your feedback and ideas, you can reach out to us by email at [email protected] The best way to understand how to work with SAR data on TileDB Cloud is to sign up and explore the public datasets and notebooks (we offer $10 credit upon signing up). Also we encourage you to register to the Capella Space developer community and ask for access to their data.