Back

Dec 20, 2024

Training Models on Atlas-Scale Single-Cell Datasets

Data Management
Vector Database
Single Cell
2 min read
Aaron Wolen

Aaron Wolen

Single Cell Product Manager

If you couldn’t join us live, you can watch the full webinar recording here to catch all the insights from our session.

In this session, we explored how TileDB, in collaboration with the Chan Zuckerberg Initiative (CZI), is addressing the challenges of managing large-scale single-cell data to enable cutting-edge research in life sciences. The webinar provided a deep dive into innovative approaches for storing, analyzing, and sharing multimodal data at scale, with a focus on the latest advancements in single-cell research.

Highlights from the Webinar

As datasets grow exponentially, especially in single-cell research, scientists face challenges with storage, access, and analysis. In the session, we highlighted TileDB’s unique capabilities in enabling researchers to work with datasets like the single-cell census, comprising nearly 90 million cells. By leveraging TileDB’s multidimensional array format and cloud-native architecture, researchers can easily query and analyze massive datasets without requiring local downloads.

Overcoming Key Data Challenges

Single-cell data brings several challenges, including:

  • Scalability: Handling datasets too large to fit in memory.
  • Interoperability: Allowing seamless collaboration between Python and R users.
  • Accessibility: Enabling direct analysis of cloud-hosted data without downloads.
  • Analysis Efficiency: Providing tools for faster, more efficient exploration and insights. TileDB’s SOMA (Stack of Matrices, Annotated) platform offers solutions to these challenges through its powerful APIs and flexible, language-independent data model.

Introducing TileDB-SOMA-ML

We introduced tiledb-soma-ml, a new library that simplifies training machine learning models on single-cell data. This technical preview demonstrated how researchers can use PyTorch to train models at scale, with optimized data workflows for efficient shuffling, sampling, and analysis of complex datasets.

Vector Search for Cell Similarity and Annotation

We also discussed TileDB's vector search capabilities, which are transforming single-cell research by:

  • Allowing automated cell type annotation using nearest neighbor algorithms.
  • Enabling interactive analysis for deeper insights into cell similarity and tissue distribution.
  • Streamlining reference mapping workflows for new datasets. These tools empower researchers to annotate and explore their data efficiently, helping them focus on driving discoveries.

Why This Matters

The explosion of single-cell and multimodal data is unlocking new possibilities in computational biology, but managing this data remains a significant challenge. TileDB’s scalable and cloud-native platform provides researchers with the tools they need to overcome these challenges, enabling faster workflows, seamless collaboration, and groundbreaking discoveries in life sciences.

Catch the Replay Now

See how TileDB is enabling the future of single-cell research by watching the webinar on-demand here.

Want to see TileDB in action?
Aaron Wolen

Aaron Wolen

Single Cell Product Manager