Taming Frontier Data Part 2: How to simplify collaboration between researchers and bioinformaticians

McKinsey reports one of the top challenges faced by pharma and medtech digital and analytics leaders is the lack of high-quality data sources and data integration. And as life sciences companies grow, aligning the work of their diverse research teams only becomes more complex. If these teams cannot work effectively together, they cannot unlock the full potential of their frontier data to power breakthroughs.

In this second of four posts on solving tough data challenges in life sciences, we will examine ways to simplify and improve collaboration across research and bioinformatics teams.

The chief cause of poor collaboration across research teams is not a failure to scale processes nor clashing personalities or siloes. Instead, it’s the lack of a single and shared source of truth in their data.

Organizations that rely on a file-based approach to storing research data risk having different research teams use different versions of the same dataset. This slows the discovery process by forcing teams to manually download and upload individual spreadsheets and even performing redundant tasks. Without concurrent access and versioning control, teams cannot efficiently query across multiple data sets or drive traceability and reproducibility, making it difficult to make informed decisions and avoid mistakes. And because these data sets are not made FAIR, AI applications cannot simplify the collaboration process.

These collaboration problems were holding back Cellarity’s cell-centric approach to drug discovery, which focuses on the entire cell instead of trying to reduce disease biology into a single molecular target. Cellarity’s data science and visualization team needed to analyze transcriptomic data from hundreds of millions of single cells to support their deep learning models. However, their file-based storage approach failed to deliver the scale and functionality they needed, leading to inefficient data wrangling across teams of engineers and scientists.

Cellarity needed to move beyond sharing individual datasets to instead create a single source of truth across all teams. Their must-haves included:

Data flexibility to model and transition AnnData into TileDB SOMA objects, enabling easy interoperation across individual experiments and aggregate datasets.
Performance that could scale to unprecedented single cell datasets, all living on object stores with distributed analysis.
Open and future-ready architecture that would simplify integration with existing solutions and applications such as Saturn Cloud.

Cellarity chose TileDB to simplify their data collaboration and reduce the data engineering burden across their teams. This new database solution enabled highly performant queries across datasets and experiments as well as data and code collaboration, empowering Cellarity’s computational scientists to spend less time wrangling data and more time focusing on the science. “We believe that TileDB is a FAIR platform. With TileDB, we now have a catalog, a single source of truth, and we can always go back to it and update it at scale in parallel,” says James Gatter, software engineer at Cellarity. “TileDB’s compute power, and the ability to slice through TileDB arrays is really great. Before TileDB, reading across catalogs and trying to update them to conform to the newest standards of data would have taken the team a significant amount of time. With TileDB we are able to make those changes within an hour.”

Cellarity now has improved computational performance and organizational efficiency through TileDB. By reducing Cellarity’s data engineering burden, their ML and computational scientists can focus on the science instead of trying to track down the correct dataset. In the next post on tackling tough life sciences data challenges, we will look at how to prevent data storage and computing costs from overwhelming research IT budgets.

Explore our case study with Cellarity here.

If you’re interested, here’s a link to review part one on how to efficiently process complex data queries at scale.

Meet the authors