Scale & Simplify Discovery with Single-Cell Omics

Unlocking the potential of single cell data with TileDB

Even the most aggressive cancer begins with a single malignant cell. This in mind, the better we can study and understand individual cell data, the better we can unlock breakthrough therapies to treat cancer. The large scale and high resolution data generated by single-cell sequencing have become critical to the discovery and development pipeline for cancer treatments.

But to unlock the potential of single-cell data for target discovery, you need technology that can manage and analyze this frontier data at scale. This is challenging for conventional tabular databases, which struggle to process the higher computing demands and complexity of single-cell data. To address this critical need, TileDB created Carrara, a powerful and flexible database solution architected around multi-dimensional arrays, as well as TileDB-SOMA, a special purpose database system for mastering single-cell data. Let’s walk through how TileDB empowers single-cell researchers to overcome their unique data challenges.

The vital role of SOMA in single-cell research

Like most areas of systems biology research, single-cell research lacks a universal standard for storing multimodal data. This leads many data toolkits to use their own format for single-cell data, making it difficult to share and aggregate data across teams or organizations. Adding to the complexity, these toolkit-specific formats typically require loading the entire dataset into memory, which is increasingly infeasible as datasets grow in size. Finally, these varying formats are not optimized for cloud object stores, which have become the preferred and most economical storage option for large-scale data. In short, the immense potential of single-cell data to drive oncology breakthroughs is often unrealized because of data technology shortfalls.

This is why TileDB partnered with the Chan Zuckerberg Initiative (CZI) to develop a scalable, efficient and user-friendly storage solution for single-cell genomics data. The collaboration aims to address the challenges posed by the rapidly growing volume and complexity of single-cell data, enabling researchers to focus more on scientific discovery and less on data management. The outcome of this collaboration is two projects:

The SOMA (Stack Of Matrices, Annotated) project is a language-agnostic data model and API specification for single-cell data, offering a scalable, efficient, and user-friendly solution for storing and processing single-cell omics data.
The TileDB-SOMA project is SOMA’s implementation with TileDB as the backend storage and processing engine. TileDB-SOMA is also open-source (under the MIT License), and takes advantage of TileDB Carrara’s powerful multi-dimensional array engine.

How TileDB SOMA empowers single-cell researchers

Built for TileDB Carrara, the TileDB SOMA implementation is optimized for cloud object stores, interoperable with popular tools and languages and highly scalable for atlas-scale data. Here’s how TileDB’s solutions address the existing issues and challenges when working with single-cell genomics data:

Interoperable: TileDB-SOMA offers efficient implementations in both Python and R, and tightly integrates with Seurat, Bioconductor and scanpy. This enables single-cell data scientists and researchers to work in the languages they prefer inside TileDB SOMA.
Optimized for object stores: TileDB-SOMA inherits the cloud-native array format, particularly optimized for object stores (such as the popular Amazon S3, Google Cloud Storage, Azure Blob Storage and MinIO). This helps TileDB SOMA operate efficiently on whatever public cloud infrastructure that single-cell researchers use.
Highly scalable: TileDB-SOMA is proven to handle tens of millions of cells in the Chan Zuckerberg Initiative CellxGene, and has the ability to scale to hundreds of millions of cells when coupled with TileDB Cloud’s distributed computing engine. This gives researcher teams confidence that TileDB SOMA can analyze the large data quantities required by single-cell research.
Support for spatial transcriptomics: TileDB Carrara offers built-in support for spatial transcriptomics to combine the precision of single cell data with the context of imaging data to determine spatial relationships. This means researchers can efficiently write and access large spatial datasets both locally and in cloud storage through a centralized data store optimized for long-term cost effectiveness.
Vector search for cell similarity and annotation: TileDB's vector search capabilities enable automated cell type annotation and interactive analysis for deeper biological insights. These advances transform single-cell research by streamlining reference mapping workflows and streamlining how researchers explore their data.

These capabilities led therapeutics company Cellarity to choose TileDB as their FAIR platform to empower their cell-centric approach to drug discovery. To support their deep learning models, Cellarity’s data science and visualization team needed to analyze transcriptomic data from hundreds of millions of single cells. However, their file-based storage approach failed to deliver the scale and functionality they needed, leading to tedious data wrangling across teams of engineers and scientists.

To learn how Cellarity unlocked the potential of their single cell data with TileDB and became able to build a single-cell atlas in less than an hour, read the full case study.

Meet the authors