Back

May 14, 2024

TileDB newsletter - May 2024

Newsletters
7 min read
Devika Garg

Devika Garg

Director, Life Sciences Product Marketing

Hello TileDB Community! Here are the latest product updates:

  • Google Cloud Platform (GCP) for TileDB Cloud Self-hosted is now available for enterprise plans!
  • On TileDB Cloud, you can choose & edit the default region on your notebooks, you’ll now see which credentials can run code on settings, and there are some new naming conventions for Assets and Monitor.
  • TileDB engine improvements ensure sequential ordering for writes from a single process and improve authentication on Azure and GCP.
  • Population Genomics: TileDB Cloud introduces ingestion of VCF files on TileDB Cloud through the UI. During VCF file ingestion, TileDB-VCF can now collect sample QC statistics. Also, two new VCF datasets are available on TileDB Marketplace: Dog genomes and SARS-CoV-2.
  • Single Cell: Blockwise iteration for memory-efficient analysis of large SOMA datasets is now available.
  • Biomedical Imaging: NDPI images are now supported.
  • Geospatial: Task graph ingestion is available for TileDB Point Cloud and TileDB Rasters.
  • TileDB Vector Search: Introducing a new graph-based “Vamana” API index for higher recall than IVF-FLAT. Support for int8 index types for scalar quantization with reduced index size is also now available.
  • TileDB Tables: New Rust bindings with 50% coverage of TileDB Array APIs are available.
You can check out the details on the latest changes below!

TileDB Cloud now supports Google Cloud Platform (GCP) for enterprise client

The self-hosted version of TileDB Cloud now supports Google Cloud Platform (GCP) integration. Companies that use GCP as their preferred storage platform can now easily register their code and data assets through TileDB Cloud. Please don’t hesitate to reach out if you’re interested in an exploratory call with our sales team!

TileDB + GCloud.png

Set default region in notebook settings, on TileDB Cloud

Along with the existing default image and server size, you can now also set a default region in the settings of your Jupyter notebook. You may navigate to Assets > Code > Notebooks and choose a notebook for which you have admin rights. You can go to Settings, scroll down, and you will find the option to choose a default region.

Image & Profile Server.png

Run code indicator for cloud credentials on TileDB Cloud

There is a new icon added to the Cloud Credentials table of your namespace settings, for each credential, that will indicate if a given credential has the permissions to run tasks & code. This is important for knowing at a glance which credentials can be injected into a task graph environment. Please note that only certain types of credentials allow for this action, which you will see when creating your credential on the UI.

TileDB Newsletter MAY 2024 - Pic 3.png

“Catalog” is renamed to “Assets”, and “Compute” to “Monitor” on TileDB Cloud

Catalog is renamed as Assets to communicate additional functionality on top of a typical catalog, like previews, metadata editing, sharing, and more. Compute is renamed to Monitor to reflect the actual functionalities exposed on this menu.

Biomedical Imaging Assets.png

New TileDB Arrays features

TileDB 2.22 is now available, with the following improvements:

  • TileDB now ensures sequential ordering for writes from a single process within the same millisecond.
  • The TileDB Azure object store integration now supports Microsoft Entra ID authentication.
  • The TileDB Google Cloud Storage object store integration now supports service account authentication.

Population Genomics - Bulk VCF ingestion on TileDB Cloud

You can now bulk ingest and automatically register your bioinformatics variant data from its original raw format on cloud storage to a TileDB-VCF Group, with a few simple clicks on TileDB Cloud!

From Assets, navigate to Add Asset > Data > Life sciences > VCF and choose the Ingest VCF dataset option. You’ll need to have an ARN credential with the relevant permissions set up, as well as cloud storage paths to where your original data is stored & where the newly-created TileDB data will be stored. Finally, select Ingest, and keep an eye on the created task graph that will run your query end-to-end!

Ingest VCF Dataset.png

Population Genomics - High-level sample stats computed upon ingestion

During VCF file ingestion, TileDB-VCF can now collect sample QC statistics, including depth, genotype quality, genotype calls, and variant type. Sample QC is computed automatically upon ingestion. These can be used to identify low quality samples and batch effects.

Samples.png

To see an example of this in action, check out this notebook on the TileDB Marketplace that shows how the sample_stats array is created and stored upon ingestion!

Population Genomics - Two new open TileDB-VCF public datasets released

TileDB has made public TileDB-VCF variant stores of dog genomes:

TileDB has also released 5.67 million Sars-CoV2 sample sequence variantscollected as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines Tracking Resistance and Coronavirus Evolution (ACTIV TRACE) initiative. The dataset is a single publicly available TileDB-VCF store!

SARS COV-2 Allele Counts.png

To work with these datasets, you can sign up to TileDB Cloud and launch your first notebook. We have a quickstart notebook for the basics of TileDB-VCF you can launch to get started even quicker!

For more advanced workflows, check out our SARS-CoV-2 notebook example, which uses a distributed query to search for common variants like D614G and Omicron.

Reach out to us if you are interested in publishing notebooks on these open datasets!

Single Cell - Blockwise iteration for memory-efficient analysis of large datasets

TileDB-SOMA now supports block-processing of data, empowering users to efficiently process large datasets without loading the entire dataset into memory. This new API provides a flexible method to iterate over your data in blocks, specifying your preferred format and dimensions. Read more in the SOMA documentation here!

It supports both row-wise and column-wise data access, ensuring that you can process your data exactly how you need it. Iterators can be created directly from any tiledbsoma.SparseNDArray or via a query object, to process a specific subset of data. Access the data in the format you need, including Apache Arrow tables or re-indexed sparse matrices.

Head over & sign up to TileDB Cloud, launch your first notebook on a Genomics image, and run this notebook example using it to get started.

TileDB Bioimaging - NDPI ingestion support

TileDB Bioimaging now supports ingestion of NDPI images through the tiledb.bioimg.from_bioimg API in TileDB-Bioimaging v0.2.11.

Channel Blending.png

To get started with your own NDPI images, launch a notebook on TileDB Cloud using our default Genomics image and this notebook example

TileDB Point Cloud - Task graph ingestion for Point Clouds

Users can now automatically ingest point cloud (LAS/LAZ formats) data programmatically to TileDB Cloud as arrays, with just a few lines of code!! The function linked below is super flexible. It can be invoked by either listing the files to be ingested or pointing to the object store that you wish to ingest. It will also work for ingesting multiple files from a single bucket, all ingested in parallel! Check out our example notebook to see this in action! You can launch this on a Cloud server & try it out on your own data!

TileDB Rasters - Task graph ingestion for Rasters

Along with point cloud data, users can also automatically ingest raster formats programmatically to TileDB Cloud. Again, this function can be invoked by either listing the files to be ingested or pointing to the object store that you wish to ingest. The raster ingestion handles overlapping raster datasets and supports the specification of the NODATA values needed for indicating either transparency or no actual data.

To get started, head over to TileDB Cloud and launch our example notebook, and follow the steps to ingest your own raster data!

TileDB Tables - New Rust bindings with 45% coverage of TileDB Array APIs

TileDB has open-sourced new Rust bindings with approximately 45% coverage of the TileDB C API, including the key Array Schema, Query, and Config APIs. We have a set of usage examples covering array creation and querying, and extensive property-based tests for the wrapped APIs. Suggestions, bug reports, and contributions are very much welcome. We’d love to hear from you on Github or contact [email protected] if you have questions or interesting use-cases for TileDB with Rust!

Thank you,

— The TileDB Team

Want to see TileDB in action?
Devika Garg

Devika Garg

Director, Life Sciences Product Marketing