Back

Sep 07, 2024

Training Models on Atlas-Scale Single Cell Datasets: Joint TileDB-CZI Workshop at scverse 2024

Single Cell
2 min read
Devika Garg

Devika Garg

Director, Life Sciences Product Marketing

CAMBRIDGE, MA - September 7, 2024 - As the field of single-cell RNA sequencing continues to evolve, researchers are increasingly interested in using these datasets to train foundational models for a wide range of applications. While training models on smaller datasets that fit into memory is relatively straightforward, scaling beyond single machines presents significant technical challenges.

WHAT: TileDB, the database designed for scientific discovery, and the Chan Zuckerberg Initiative (CZI) will lead a workshop at the inaugural scverse Conference titled, “Training Models on Atlas-Scale Single Cell Datasets.” Attendees will gain hands-on experience training models on CZI’s CELLxGENE Discover Census, a large dataset comprising 70 million cells with laptop-sized memory on Python, and learn about the technologies and resources that make this possible.

WHO: This workshop will be co-led by Ryan Williams, SOMA Software Engineer at TileDB, who has extensive experience in analyzing bulk and single-cell genomics data; and Maximillan Lombardo, Senior Product Applications Scientist at the Chan Zuckerberg Initiative, who is responsible for collaborating with the CELLxGENE team to engage the single-cell community and enhance the adoption of CELLxGENE tools.

WHEN: Thursday, September 12, 2024, at 9 a.m. local time (60 minutes long).

WHERE: scverse Conference, Technical University of Munich, Munich Germany; main conference room - MW 0350

scverse-2024-page-thumb .jpg

STRUCTURE: The workshop will cover the following:

  • Section 1: TileDB The open source data format and storage engine enables efficient indexing and retrieval of large datasets stored on remote object stores like AWS S3.
  • Section 2: SOMA A language-agnostic data model and API specifically designed for storing and querying single-cell data using TileDB's format.
  • Section 3: CZ CELLxGENE Discover Census The world's largest public resource providing standardized single-cell data to researchers worldwide.
  • Section 4: SOMA/Census PyTorch Loaders Specialized loaders for PyTorch modeling optimized for memory-efficient training via TileDB-SOMA's support for out-of-core data access.

About TileDB

TileDB is the foundational database designed for scientific discovery. Powered by shape-shifting arrays, TileDB resolves the complexity of multimodal data so scientists and data teams can effectively glean meaningful insights from it. For more information about TileDB, visit: tiledb.com.

Want to see TileDB in action?
Devika Garg

Devika Garg

Director, Life Sciences Product Marketing