Director, Life Sciences Product Marketing
How can biotech organizations scale their data storage and analysis for single-cell data from tens of millions of cells? Why are standardization and data aggregation key to managing widely variable data across studies and repositories? What is the best way to prepare large-scale single-cell data for machine learning applications?
These were some of the complex questions tackled in TileDB’s recent tech talk with Sam Cooper, CTO and Co-Founder of Phenomic AI. Cooper shared Phenomic AI’s journey of transforming cancer therapies by scaling single-cell analysis, and walked through how TileDB delivered the robust data management their organization needed. In this recap blog, you will learn three key takeaways from the full webinar to guide small- to medium-sized biotech firms.
As a biotech company developing new therapeutics for solid cancers, Phenomic AI’s approach centers on single-cell biology. “Single cell biology is the first sort of technology that’s allowed us to understand the full set of cell states that exist inside solid human tissues,” Cooper said. Since its founding in 2017, Phenomic AI has built a large atlas of tissue data based on single-cell RNA sequencing from thousands of patient samples, 1,600 mouse samples and 500 spatial samples.
When Phenomic AI scaled its dataset to nearly 100 million cells, they knew the sheer size of this dataset demanded a technology change. Traditional tabular databases stored in flat files on Amazon S3 struggled to efficiently process single cell data at this scale. What’s more, the combined datasets grew beyond the memory constraints of large AWS instances and could not handle Phenomic AI’s complex metadata queries and single-cell access patterns.
These challenges led Phenomic AI to partner with TileDB as their database solution for storing and managing their massive dataset. TileDB enables Phenomic AI to handle the scale and complexity of their single-cell analysis and querying by storing the data in a shared and efficient multidimensional array. This has been immensely helpful to driving discovery in Phenomic AI’s single-cell data, as Cooper described, “We were excited to have a solution that lets us put it all in one place, in one format and be able to do big analysis without things breaking down.”
Cooper unpacked Phenomic AI’s innovative approach to driving discovery in single-cell research, beginning with using AI to power transcriptomics analysis of human tissue samples and building a massive atlas of tissue data based on single-cell data from many different studies. Using advanced machine learning applications for integrating curated scRNA at scale, Phenomic AI is improving target discovery of novel stromal targets.
However, this led to huge increases in data processing demand, slowing the ability of Phenomic AI’s bioinformaticians and data scientists to effectively query and analyze this data. To optimize their data infrastructure at the scale required for their massive single-cell dataset, Phenomic relies on TileDB’s platform and is transitioning to a specific data loader created by TileDB for added simplicity. “Optimizing our data infrastructure and increasing the amount of training data was the key thing in getting our models really accurate,” said Cooper, “We didn’t actually adjust the ML-architectures that much.”
Cooper also described how Phenomic AI created their massive combined dataset by drawing from 48 different studies as well as some smaller benchmark data sets. However, combining all these datasets led to challenges with data alignment and batch effects, which made it harder to create and share useful analysis. As TileDB’s Director of Product Marketing Devika Garg put it, “When there’s too much variability across different methodologies or different algorithms available for analysis, there’s a lot of custom formats happening. Every single experiment becomes a silo, and this lack of interoperability hinders collaboration within companies.”
To address this challenge, Phenomic AI developed a ML-powered data alignment pipeline designed to minimize errors from batch effects. Key to this pipeline was their modified Scvi model built with adversarial training, helping it outperform other methods like Geneformer in avoiding batch effect issues. This enabled Phenomic AI to use the combined data set to predict cell type labels in held out or missing datasets. Phenomic AI has also seen more effective collaboration with the trusted research environment that TileDB created, which manages all single-cell data, metadata and custom algorithms in a shared contextual platform for the Phenomic AI research team.
To learn more about Phenomic AI’s scalable approach to single-cell data analysis with TileDB, watch the full webinar here.