Back

May 05, 2020

TileDB Cloud: Data Science with Less Engineering

Data Science
6 min read
Seth Shelnutt

Seth Shelnutt

CTO, TileDB

The Helicopter View

TileDB Cloud is a serverless solution that allows data scientists to collaborate and work faster, focusing on their Science without Engineering hassles. Today we announce the next release of TileDB Cloud, which includes capabilities such as hosted Jupyter notebooks and serverless execution of user-defined task graphs with arbitrary dependencies.

We hear the community of data scientists and developers complain about the level of engineering and effort required in dealing with lots of data for scalable computations. A lot happens before you even get to executing a single line of analysis code! Unfortunately, you are spending your time wrangling legacy data formats, spinning up clusters, deploying open-source tools and perhaps overpaying for cloud resources you do not use. These engineering challenges ring true across the board regardless of your industry or application domain (e.g., genomics, geospatial, finance).

For the past 3 years we have been developing a powerful, open-source, cloud-native storage engine based on multi-dimensional arrays, a universal data format suitable for all applications and data science tools. Today we also announced TileDB 2.0, a huge milestone in addressing the storage and access challenges for data scientists. But the TileDB storage engine was just the beginning. It serves as the foundation of a bigger vision — help scientists and analysts to focus on accelerating science and innovation.

The design themes for TileDB Cloud are all about delivering ease-of-use and performance at scale with lower costs. In this blog post I will highlight the most important features.

Data Organization and Sharing

Do you find it difficult to see the kind of data on your cloud store, their URIs, metadata or descriptions? Have you tried looking for public datasets related to your work, or share your data with your peers defining access policies? What if those datasets can be accessed directly from your favorite data science tools and you never have to download, copy or convert the data? TileDB Cloud does exactly that and solves common problems around data organization and sharing.

In TileDB Cloud, you continue to own your data. Simply create an array with the TileDB open-source storage engine and store it in your own AWS S3 bucket, and then register it with our service. This effectively gives us the means to list it for you and govern its access by your organization and other external users. Freely add metadata, descriptions and access policies to your arrays to control who has access to the data. Most importantly, you no longer need to manage a set of files; TileDB abstracts all the details behind arrays that are ready to access directly from multiple APIs (C, C++, Python, R, Java, Go) and integrations (Spark, Dask, MariaDB, PrestoDB, PDAL, GDAL), all without moving any data around. All you need is the array `tiledb://` URI registered with TileDB Cloud, and your code will work just like with the open-source TileDB library. All array data remains in its open-source form, so there is no extra coding effort or vendor lock-in.

1_zwbbe7sb1nlsrlgqgtvsqg.gif

Sharing an array is only a few clicks away

Another great feature of TileDB Cloud is the ability to make your data sets public and discoverable by anyone. Open data ecosystems are just getting started across different industries. Stay tuned, we will share and post many public datasets that we or our customers curate.

Introducing Jupyter Notebooks

An exciting new capability of TileDB Cloud is hosted Jupyter notebooks. Jupyter notebooks have become the de-facto choice in the data science community when it comes to exploratory data analysis, where it is easy to create and share documents that contain live code, equations, visualizations, and narrative text. You can choose from multiple images with different installed libraries and various server configurations. Start and stop a notebook anytime and pay only for the time you use it. I’ve include a number of notebook examples for different applications so that you can kickstart your work within seconds. No painful onboarding: just sign up and go.

1_dmbmc2vpbinudxoicfrryg.gif

Jupyter notebooks are offered in several types and sizes

Serverless SQL

One of the biggest milestones of TileDB 2.0 is the generalized support of dataframes by adding heterogeneous and string dimensions to sparse arrays. This practically allows you to slice dataframes on a subset of columns (the “dimensions”) efficiently. Our integration with MariaDB allows you to perform any SQL query on TileDB arrays.TileDB Cloud takes advantage of this integration and offers you the ability to submit SQL queries on arrays stored on AWS S3, in a totally serverless fashion. No need to provision for a cluster a priori or register a specific SQL query. Simply submit any SQL expression and TileDB Cloud will take care of it, charging you based on the CPU usage and the data moved outside the service.

1_dmbmc2vpbinudxoicfrryg.gif

Severless SQL is only one command away

Serverless User-Defined Task Graphs

Any distributed computation, no matter how complex, can be modeled as a directed acyclic graph where nodes represent sub tasks, and edges represent dependencies. In other words, the graph explains which tasks should be computed in parallel or in some sequence. And each task can be arbitrary: a SQL query, a slicing operation, a system command or a user-defined function. This has been captured very well by Dask and its' dask.delayed package. Have you ever imagined a serverless version? This is exactly what TileDB Cloud unveils in this release!

Specifically, write any set of Python functions (support for other languages are on the roadmap), define dependencies by creating a task graph, and submit this task graph to TileDB Cloud. Most importantly, all without having to explicitly spin up a cluster of a specific size. TileDB Cloud takes your graph and dispatches each task to its resources, running in parallel the tasks that are allowed and respecting the dependencies for those that do not. In addition, TileDB Cloud maximizes parallelism by constantly expanding until it completes all your tasks. And you only pay for the time it takes to finish each task: the amount is the same whether TileDB Cloud ends up processing your graph on 10 or 100 machines. This eliminates the extra costs of idle compute, delivering maximum performance at a minimum cost.

1_us5lcthnrk5hcgt5eotqoq.gif

Easily create a task graph of many delay types and visualize the DAG

Summary

We envision a world where data scientists spend most of their time on what matters — innovating, collaborating with their peers and delivering brilliant science and insight. TileDB Cloud enables that vision with data in multi-dimensional arrays, coupled with easy sharing and serverless compute. The features in this blog are just a subset of what TileDB Cloud offers. Check out the docs for additional details. Also stay tuned, we will share how our customers are using TileDB Cloud soon.

Are you ready to kickstart your data science? Join our community today by signing up to TileDB Cloud.

Want to see TileDB Cloud in action?
Seth Shelnutt

Seth Shelnutt

CTO, TileDB