For multimodal AI to bear fruit, biopharma teams need an unshakeable multimodal data foundation

Data, and biology especially, has always been multimodal and mind-bogglingly complex. At the cellular level, data spans multiple parameters that can be genomic, proteomic, metabolomic, transcriptomic and resolved spatially with high content imaging. At the level of human health, parameters can include phenotypic information from EHRs, patient-reported outcomes, remote monitoring sensors, real-world data and results from clinical trials.

But in this complexity lies immense possibility. We are now at the cusp of embracing all of the complexity to get specific and to get precise about our approach to human health.

We're seeing a shift from hypothesis-driven to data-driven discovery. Traditional approaches relied heavily on predetermined hypotheses, while today's most innovative companies are letting patterns in large, multimodal datasets guide their discovery process—often revealing biological relationships that weren't previously suspected.

But even as science and technology has expanded our ability to read and capture more of this data at a granular level, our capabilities to handle that data and mine it for insights hasn’t scaled in kind. With the rise of LLMs and agents, the need to focus on multimodal AI is rising. Foundation models in biology train on large quantities of biological data, but before you can do that, it is absolutely essential to focus first on a multimodal data foundation.

These trends create both excitement and complexity for drug discovery teams, inspiring key questions:

How does your team effectively harness these massive, multimodal datasets?
Where does all your data live today?
How many copies of the same file are floating around in the abyss of fragmented storage across instruments and applications?
How can you build infrastructure that scales with your needs?
And how do you avoid expensive data engineering rabbit holes that distract from your core mission?

From everything we discussed at the panel, here are a few salient takeaways:

Data and AI teams need to be tightly integrated with biologists and scientists to drive goal-oriented data strategies. If you have chosen to invest in an AI strategy, don’t let it be isolated from the business, and from biologists in particular.
In your pursuit of exciting AI projects, you will fail at least once—and that will drive the need for a comprehensive multimodal data strategy where data organization and metadata become critical. In other words, your AI projects will fail at production scale if they’re not rooted in a foundational future-ready data fabric.
We must balance between data quality and quantity when training models. Models need a ton of data, which may not always be easily available. In these situations, you will be faced with questions of whether to prioritize quality or quantity. Should you let your models impute missing data, or should you also lean on synthetic data? This will also drive the need to harmonize vast amounts of public data with your own datasets.
A FAIR approach to data drives innovation rather than slowing you down. You need thoughtful data management processes to make sure your data can go far. Teams that succeed and avoid extra work invest in practices that prioritize data being FAIR at source, as well as analyses being traceable downstream with logging.
Pick a technology stack that serves your multimodal needs into the future. Data that captures the complexity of biology is changing with every new Nature Methods paper that gets published. ChiP-seq was hot years ago until it was taken over by single-cell methodologies, which are now being followed by spatial techniques. What makes a real difference in your data strategy is having a future-ready data platform that has foundational capability to store and manage evolving bleeding-edge frontier data in a way that is as close to being ready for analysis as possible.

Investing in a lasting data strategy can serve everyone well. For data and AI teams, this can bring easier access to large-scale data that they need to build models, agents and GenAI applications. For bioinformaticians and computational teams, this can mean less headaches over tracking where data is and being able to run end-to-end pipelines; For biologists, this can mean faster time to insights. Ultimately, for biopharma organizations, it can mean being first and faster to market with greater efficiency.

So before you ask “Why is the failure rate of our AI projects so high at production scale?” be sure to ask “Is our data strategy designed for our multimodal present and future?”

A heartfelt thank you to our panelists from bringing their insights to the table.

Ready to prepare your organization for the multimodal future? Download the buyers guide to multimodal data platforms to learn more and unlock the full potential of your multimodal assets.

Meet the authors