TileDB Cloud allows you to easily create, run and share Jupyter notebooks. The notebooks are internally represented as multi-dimensional arrays (TileDB’s data currency) and inherit all the array functionality, including versioning and time traveling. Coupled with the ability to efficiently manage and analyze a large variety of data modalities (such as tables, images, genomics, point clouds and more), TileDB Cloud offers a unique holistic data management experience to organizations, universities and citizen scientists. This article covers TileDB Cloud’s notebook feature in detail.
TileDB Cloud does not store your data, you retain full ownership at all times. Therefore, you need to create your own AWS S3 bucket, along with access keys or IAM roles that allow access to that bucket. Next, you need to register your access keys or IAM role from Profile
→ Cloud credentials
→ Add credentials
→ Access and secret key
or ARN Role
.
Finally, you need to set the default storage paths where any new or uploaded notebook will be stored from Profile
→ Storage paths
.
You are all set and ready to start enjoying Jupyter notebooks on TileDB Cloud!
You can create a new empty notebook from Notebooks
→ +
. You will need to provide a storage location for the notebook (a path inside your S3 bucket), the credentials you created that have write access to that path, and a name.
Once you create the notebook, you can click on it to see some basic information under Overview
. You can see the S3 URI that describes the physical location of the notebook, along with the TileDB URI that is an alias for this notebook. In TileDB Cloud, I can refer to this notebook as stavros/my_first_notebook
or using the full TileDB URI path that contains a unique UUID to disambiguate any potential naming conflicts.
I can also click on Add description
to add some descriptive information about this notebook (especially useful if I wish to share this with someone else as we will see below). The description supports full markdown.
We are ready to launch the notebook and start doing some work!
To launch the notebook we just created, just press on the Launch
button. That will prompt you to choose an image and a server instance. Each image comes with its pre-installed packages, which you can see by following the provided links. TileDB (and ancillary packages) are always installed and up-to-date in all the available images.
The Jupyter notebook runs on an AWS EC2 instance in the us-east-1
region. Every user gets a 2GB persistent storage in an EBS volume (also in us-east-1
). This is mounted as the home directory in the notebook server. All contents in the home directory will persist across server restarts. The user does not get charged for storage! In that server, you can install your own packages, as well as create and upload any file.
From this point onwards you can use the Jupyter notebook as you would ordinarily do on your laptop. You can terminate the server at any time by clicking on the Shut down
button at the top right corner.
The next time you launch this notebook, it will be launched in the same server instance and with the same image you selected the first time. You can change this configuration in the notebook Settings
.
You can upload an existing notebook that you have on your machine. Similar to creating a new empty notebook, click on the +
button in Notebooks
, but now select the second option to upload your notebook.
Next, follow the instructions to locate your notebook on your machine and click on Next
. In the next screen, you need to select the path in your S3 bucket to store the notebook, along with the AWS credentials that allow you to write in the bucket.
Clicking on Upload
, TileDB will fetch your notebook, convert and store it as an array in your S3 bucket and register it with TileDB Cloud. From this point, you can see its Preview
, launch it, share it, etc.
Let me add a couple of lines of code to my notebook and save it. If you go to Preview
, you will see the latest version of the notebook, rendered for easy consumption.
A cool thing about TileDB is that it stores everything, including the code of Jupyter notebooks, as arrays. This allows TileDB to provide all the awesome array functionality to notebooks, including versioning and time traveling, as well as access control and logging (covered in later chapters below).
If you click on the Latest version
label next to the notebook name, you will see on the right all the versions created for this notebook. If I click to the previous version of the notebook, you will see that Preview
renders the contents of that notebook before I made the latest change (the notebook was empty before). TileDB also allows you to download different versions of the notebook.
Every time you save your notebook, TileDB creates a new version (more specifically, a new “array fragment”, which we will cover in another article). TileDB allows you to prune past versions of your notebook in the Settings
tab.
If I select Keep last version
and click on Prune now
, then all past versions but the latest will be deleted. You can see below that only one version exists in my notebook.
You can securely share a notebook with anyone on TileDB Cloud. Choose a notebook, click on Sharing
and add any TileDB Cloud username (if the user exists) or email (if the user does not exist).
Users that do not exist in TileDB Cloud will receive an email notification to join and gain access to your notebook. You can also easily revoke access to any user with a single click.
If you wish to make your notebook public, i.e., discoverable and usable by any TileDB Cloud user, go to Settings
and click on Make public
on the right of the tab. You can make the notebook private again by clicking on Make private
in the Settings
tab.
Once you make a notebook public, you can share its URL and any user (even those that are not TileDB Cloud users yet) can see its overview and preview. For example, try visiting https://cloud.tiledb.com/notebooks/details/stavros/5ed6011f-c13e-4418-8bd7-27d945557cbf/preview
.
In addition, any user (again, even those that are not TileDB Cloud users yet) can discover your notebook in the Explore
tab.
Finally, you can always download the notebook using the download button (arrow pointing down) from the notebook details. The notebook will be converted back to the Jupyter notebook format and it will be ready for consumption by your local Jupyter environment.
TileDB Cloud logs every single action that users perform with their notebooks. Since a notebook is internally represented as an array, it inherits all the logging functionality of arrays. You can see the log activity of your notebook by simply clicking on the Activity
tab.
You can rename a notebook by clicking on the Rename notebook
button in the Settings
tab.
Note that renaming a notebook does not change the physical S3 path. It also does not change the TileDB URI. TileDB Cloud provides you with a convenient way to rename and reorganize your notebooks on S3, avoiding cumbersome and expensive data movement (S3 does not allow you to rename an object in place, it actually moves the entire object to the new location).
If you’d like to clone a notebook, simply click on the Copy notebook
button (next to Launch
and Download
buttons). You will be asked to provide a storage path and name for the clone.
The contents of this notebook are identical to the original notebook.
It is possible to perform any UI console task programmatically via a TileDB Cloud client. Here, I will use the Python client, but TileDB Cloud provides clients in other languages as well (e.g., R, Java, etc). To install the TileDB Cloud Python client run:
pip install tiledb-cloud
Next, from any Python environment, run the following to login:
import tiledb.cloud
tiledb.cloud.login(username=..., password=...)
An alternative for logging in is by creating an API token from the UI console and running:
tiledb.cloud.login(token=...)
You can create an API token from Profile
→ API tokens
.
I will show a few examples of using the TileDB Cloud client, but check the documentation for more information.
You can get the basic information about a notebook as follows:
>>> I = tiledb.cloud.array.info("tiledb://stavros/simple_1d_array") >>> I.description '# Simple 1D Array Example\n\nA simple example of creating a 1D array from a numpy array. ' >>> I.uri 's3://tiledb-stavros/notebooks/simple_1d_array'
You can see who this notebook is shared with:
>>> tiledb.cloud.array.list_shared_with("tiledb://stavros/simple_1d_array") [{'actions': ['read', 'write', 'read_array_info', 'read_array_schema'], 'namespace': 'seth', 'namespace_type': 'user'}, {'actions': ['read', 'read_array_info', 'read_array_schema'], 'namespace': 'public', 'namespace_type': 'organization'}, {'actions': ['read', 'write', 'read_array_info', 'read_array_schema'], 'namespace': 'ihnorton', 'namespace_type': 'user'}]
You can even download the notebook as a Jupyter notebook on your machine:
>>> tiledb.cloud.notebook.download_notebook_to_file( tiledb_uri="tiledb://stavros/simple_1d_array", ipynb_file_name="./simple_1d_array.ipynb", )
You can use the TileDB Cloud client for literally everything you can do on the UI console.
One of the most important capabilities on TileDB Cloud is that it allows you to manage all complex data in a single, unified data platform. That means that you can manage and run your notebooks in the same platform as the one you use to organize and manage your data. I will show an example here, but the sky's the limit when it comes to adding more data and code to TileDB Cloud.
I will first create a group to store my data and notebooks. Data and metadata about an empty group is physically stored on S3, similar to notebooks, but it is usually tiny. Details about groups will be covered in a separate article. For now, I will go to Groups
and press the +
button to create an empty group. The process is similar to creating an empty notebook; you need to provide a name, the physical S3 storage path, and the AWS credentials to access the storage path.
The contents of the group are empty, and you can see in Overview
some basic information about the group (such as its S3 and TileDB URIs) and add a description. You can also share it with other users similarly to a notebook in the Sharing
tab.
Next, I will add an array to this group. We will cover creating arrays in a separate article, therefore here I will use an existing public one. I will go to Contents
→ Add asset
→ Add existing asset ...
. Then I will select Arrays
→ Explore
, type nyc
and select nyc_tlc_yellow_trip_data_2019
, which is an array containing data from the NYC taxi dataset for 2019.
The array becomes part of the group contents. Note that this action does not create a clone of the array. The array is only virtually added to the group _by reference _(any change to the array will be reflected inside the group as well).
Next, I will create an empty notebook by clicking on Contents
→ Add asset
→ Add new asset
→ Notebook
, and then following the process similar to what we described above for creating an empty notebook. This will create the notebook as before, add to your Notebooks
assets, and virtually add it to the group. Similar to the NYC array, this notebook can be virtually included in multiple groups, without physically cloning the notebook.
I will create a little bit of code to show how to access the NYC taxi array, but advanced access of tabular data will be covered in a separate article.
Finally, I can make the entire group public from its Settings
so that you can access it from Explore
.
TileDB Cloud provides a powerful way for organizing, running and sharing Jupyter notebooks. You have full control over their physical storage, as well as their access and logging information. You can choose from a variety of server instances to run them on and a set of images with different installed packages. You are granted free EBS storage and you can upload your own files and install your own packages. You can even access all information about your notebooks, and manage all aspects of them programmatically via an easy API. Finally, you can keep your notebooks along with the data the notebooks create and access in a single platform, eliminating the need for having to juggle multiple different systems or building holistic data management capabilities yourself from scratch.
Join the TileDB Cloud community and enjoy the power of a unified data platform today!