Population genomics is a data management problem

Table Of Contents:

Introduction

The problem in population genomics

A solution template

The solution with TileDB

Building Scalable Genome-wide Analyses on TileDB Cloud

Integrative Clinical Genomic Workflows

Work in progress

Q&A

We recently hosted a webinar that delved into the immense challenges involved with managing and analyzing large-scale genomic variant data. The population genomics field continues to generate some of the largest biomedical datasets in the world, fueling new insights into genetic disease, and generating novel therapeutics that can be personalized to each individual.

In this webinar we make a bold claim: population genomics is a data management problem (or, at least, it is today). We argue the volume of data being generated today has reached a point that could significantly impede further scientific progress. Insights need to be generated by analyzing data at extreme scale (on the order of thousands or millions of genomes, which translates to many PBs). And there is currently no solution to manage and analyze the data in a practical way, in terms of performance, cost, ease of use and accessibility.

At TileDB, we have been working on this problem for a long time (since my days at Intel Labs and MIT, and the collaboration we had with the Broad Institute). The main goal of this webinar is twofold:

1
Carefully describe the roots of the problem, which extends beyond what most organizations are dealing with today.
2
Outline a foundational approach that is powerful enough to overcome today's data management challenges and flexible enough to provide a platform on top of which the next generation of genomic analyses can be built. Also present a solution with TileDB that proves the feasibility of this approach.

I was very fortunate to be joined by my colleague Aaron Wolen, who covered the technical aspects of the TileDB solution and provided several demonstrations of this solution in action. We were also immensely honored to have Dr. Stephen Kingsmore, President and CEO of the Rady Children’s Institute for Genomic Medicine, as a guest speaker. Dr. Kingsmore presented his pioneering work on genome-informed pediatric care and explained how TileDB can facilitate it.

You can find the full webinar recording below. In the following sections I break down the video into smaller clips and provide the gist for easier consumption. Enjoy!

Introduction

We are extremely privileged to have an amazing team that executed this challenging work and built a superb solution. We’re also very fortunate to have an outstanding partner, Helix, who we have been working with for many years and who have helped us immensely in deeply understanding the use case and optimizing TileDB to the extreme.

The problem in population genomics

Superficially, the data management problem in population genomics stems from a reliance on VCF files for data storage. This has been pointed out many times, but large collections of VCF files are extremely difficult to analyze at scale. Various attempts have been made to overcome VCF’s limitations with mixed success.

However, there is a much deeper, more holistic problem: Data management is nowhere to be found; organizations are stuck dealing with flat files and custom tools. In essence, the whole data economics in Genomics is flawed. That includes:

1
Data production: The data is produced in a format (VCF) that inherently scales poorly and creates significant bottlenecks when attempting to perform basic operations like basic filtering, aggregating and updates (i.e., the so-called N+1 problem). An immediate consequence is that organizations spend an inordinate amount of time and money building custom solutions that revolve around wrangling VCF data and cobbling together a variety of tools to analyze the data at any kind of scale.
2
Data distribution: As flat files, VCF data is extremely difficult to govern and share. Organizations resort to sharing cloud buckets or employing file manager systems for secure governance that lack analytics capabilities. Moreover, those who wish to share (or monetize) their data (and code) with other organizations are forced to build custom infrastructure and bear the entire cost of operations and maintenance.
3
Data consumption: Data access and analysis largely relies on domain-specific tools. This prevents the genomics community from fully leveraging the enormous and ever growing set of tools from the broader Data Science ecosystem. And of course, performance is always an issue due to the format the data originally comes in (back to the production problem).

A solution template

So how do we solve the problems outlined above? Here is a foundational approach, a “solution template” if you will, that includes a set of recommendations for anyone building data systems for population-scale genomics (or any domain for that matter, but I am leaving that for a separate blog post):

1
Data production: The community really needs to depart from the VCF format. The format the variant data should be stored in should be analysis-ready, cloud-optimized, and interoperable with every single data science and analytic tool out there. It should also be able to model other data types (such as tables, images, etc) that could be combined with the variant data in large-scale GWAS. In other words, the format should be generic and universal, not custom-made just for Genomics.
2
Data distribution: It is high time the community departed also from the notion of file sharing. The problem of data governance (and auditability) has been long solved in the Databases domain. What organizations need is a full-fledged data management platform, which offers governance, auditability and analytics, all at extreme scale. Moreover, the data management platform should shift the cost of access and analysis to the consumer, instead of placing it on the data owner. In other words, it should be able to offer “marketplace” functionality (for both data and code) built-in.
3
Data consumption: The data should be accessible and processable by any programming language and any tool, at any scale. Downloads and data wrangling should be absolutely eliminated. No complicated infrastructure should be necessary, as the data management platform under “data distribution” should take care of it.

Is the above even possible? To prove that it is, in the next section I explain how we solved the problem with TileDB, following the above approach.

The solution with TileDB

We took the following decisions with TileDB:

1
Data production: We built an open-source data format and storage engine around (dense and sparse) multi-dimensional arrays. That software is called TileDB Embedded and it can capture data of any type, from tables, to images, to genomic variants. Specifically for genomic data, we built another open-source library called TileDB-VCF which effectively reduces querying a huge number of (g)VCF files into querying a 3D sparse array, which can be done very efficiently with TileDB.
2
Data distribution: We built a powerful data management platform called TileDB Cloud. This is responsible for scalable compute, governance at global scale, auditability and monetization. Its SaaS version eliminates completely the need for data owners to build infrastructure, and shifts the cost of operation to the consumers. TileDB Cloud goes way beyond genomic variants and unifies all data management needs, including clinical data and metadata that can now be efficiently fused with the genomic data. We managed to achieve that exactly because we normalized all data types under a single data model and format, the multi-dimensional array.
3
Data consumption: The TileDB data format is interoperable with a variety of programming languages (Python, R, C, C++, C#, Go, Java, JavaScript), databases (MariaDB, Presto/Trino) and computational frameworks (Spark, Dask). Users are now flexible to use any tool they are comfortable with, without being locked in domain-specific ecosystems. Moreover, accessing huge public datasets has become super easy, without necessitating building and running massive data infrastructures in-house.

We have battle-tested our solution on datasets in the order of hundreds of TBs with our customers. The live tutorials Aaron presented in the webinar (included below) offer a good taste of the TileDB solution, but we are happy to provide more information and tutorials upon request.

How TileDB Facilitates Rady’s Clinical Newborn Screening Program

We were honored to feature Dr. Kingsmore in our webinar, who presented his great work around using population genomics data technologies to solve incredibly difficult problems for inpatient pediatric care. He also described how TileDB can facilitate the analysis and accessibility of large genomic datasets for his 72 other partner sites. We are super excited about our partnership with Dr. Kingsmore and Rady Children’s Institute for Genomic Medicine.

TileDB-VCF Basics

As mentioned above, TileDB-VCF is our purpose-built library for managing genomic variant data. As you’ll see in the video, it couldn’t be easier to get up and running: simply create a new TileDB-VCF dataset and point to the VCF files you want to ingest. The data is ingested losslessly, so you can always recreate the original VCFs. But storing this data in TileDB allows you to efficiently slice it by sample and/or genomic region, and rapidly add new samples to the dataset (solving the N+1 problem). And importantly, all of these operations can be parallelized and performed in the cloud, providing a powerful foundation for running genome-wide tasks across huge populations.

Notebook used:

https://cloud.tiledb.com/notebooks/details/TileDB-Inc/tutorialtiledbvcfbasics/preview

Building Scalable Genome-wide Analyses on TileDB Cloud

Aaron demonstrated how analyses can be performed genome-wide using TileDB-VCF to partition the genome into discrete bins and TileDB Cloud to process each one independently and in parallel. In this case, a series of user-defined functions (UDFs) were orchestrated in a pipeline to calculate population-level metrics, filter variants using those metrics, and then perform an association analysis. All without having to download any data or fiddle with clusters. Furthermore, the variant data, notebooks, and UDFs were all registered on TileDB Cloud, so they’re easily shareable, allowing others to reproduce or extend your work.

Notebook used:

https://cloud.tiledb.com/notebooks/details/TileDB-Inc/tutorialtiledbvcfgwas/preview

Integrative Clinical Genomic Workflows

We can use this same fundamental approach to quickly find a specific set of variants in a mountain of variant data. For customers like Rady, this means building queries that integrate many different types of data, including patient data, results from internal statistical analyses, and external genomic annotations. The power of TileDB's universality is that all of these data types can be stored in a single format and integrated into a sophisticated query to find and annotate the relevant results. In this final demo, you’ll see how to enrich a parallelized VCF query by incorporating sample metadata from the Thousand Genomes Project and gene/exon annotations from Ensembl, all using serverless UDFs to build a single distributed task graph.

Notebook used:

https://cloud.tiledb.com/notebooks/details/TileDB-Inc/Genomics-Workflow-Example/preview

Work in progress

We have a long list of features and optimizations coming up. You can also suggest a feature and tell us about your needs on our feedback site. This raises an additional important point: we argue that the community should depart from domain-specific and inflexible formats, as well as custom-made tools, and transition to generic storage engines and data management platforms that can evolve and encompass future technological advancements, without affecting the user APIs and downstream applications. For example, by adopting an engine like TileDB, which will always keep up-to-date with the latest storage backends, computational hardware and data science tools, you will never see your custom formats and tools being rendered obsolete and requiring a complete rearchitecting due to a particular technology shift. TileDB gets inspired and improved by customers and challenging applications that span way beyond Genomics, and all optimizations are inherited and enjoyed by all users regardless of the application domain. So stay tuned!

Q&A

We were happy to receive a lot of questions from the audience during the live session. We are of course very happy to answer your questions offline, and there are several ways to contact us.

Meet the authors