Taming Frontier Data Part 1: How to efficiently process complex data queries at scale

Somewhere in your data could be a finding that leads to a breakthrough treatment. However, the sheer size and complexity of data from clinical trials, real world evidence, preclinical discovery, public data and other sources can make this exercise akin to finding a needle in a haystack—and the haystack is getting bigger by the year. Approximately 30% of the world’s data is being generated by the healthcare industry, and the compound annual growth rate of healthcare data is estimated at 36%. What’s more, a Tufts Center for the Study of Drug Development study found that a typical Phase III clinical trial creates 3.56 million data points.

In short, life sciences organizations are in a tough situation. Their frontier data, which is novel and highly valuable data drawn from new sources like genomics and transcriptomics, represents both significant opportunity and intimidating complexity. Regardless of their research focus, life sciences organizations face diverse challenges in mastering their unstructured data—but through new approaches to data management, innovative companies are finding a better way. In this first of four posts on solving tough data challenges in life sciences, we will explore how to efficiently process complex data queries at scale.

Life science organizations in areas like single cell research face an unprecedented scale in the datasets they must manage and process—and those datasets are only getting larger. Data quantities that were once measured in tens of gigabytes are now grown beyond the memory constraints of single machines. To pursue cutting edge solutions like multiomics and multimodal data, life science companies need to master their frontier data, which could contain a game-changing discovery that would lead to a transformative new drug.

This is the challenge Phenomic AI faced. To improve their oncology target discovery, Phenomic AI scaled up their single-cell data volumes, growing from 2 million cells to approximately 30 million cells within one year. While this increased scale enabled more robust target discovery, the increased data processing demand slowed the workflows of their bioinformaticians and their ability to analyze new single-cell data. This made it difficult to scale complex metadata queries and support specific single-cell access patterns for Phenomic AI’s accelerated implementations of key tools such as differential gene expression.

The Phenomic AI bioinformatics team needed a platform to better manage their single-cell data workflows and master complex metadata queries at the scale of their research aims. Key requirements included:

A unified system with cataloging capabilities for all single-cell datasets application and experimental metadata.
A single platform for multiomics to support future plans spanning proteomics and spatial transcriptomic analysis.
Usability and ease of extracting, filtering and downsampling subsets of large datasets to accommodate analyses like differential gene expression at rapid speeds.

Phenomic AI found their solution in TileDB, which checked all these boxes for their current single cell and future multi-omics requirements. Instead of struggling under frontier data volumes that were too big for the memory, Phenomic AI is now able to efficiently process complex data queries at the scale of their research ambitions. “TileDB was the best database and platform out there for our cloud workflows and unique domain of single-cell research,” said Sam Cooper, CTO and Co-Founder at Phenomic AI, “TileDB delivered the analysis speed, scale and usability throughout our evaluations.”

Shifting all of Phenomic AI’s data into TileDB enabled them to scale from hundreds of thousands of single cells to tens of millions, creating an enormous, unified repository of data that they can very quickly query to identify new and exciting drug targets. In the next post on tackling tough life sciences data challenges, we will explore the best ways to ensure secure and effective collaboration across research and bioinformatics teams.

Explore our case study with Phenomic AI here.

Meet the authors