We recently hosted a webinar that delved into the immense challenges involved with managing and analyzing large-scale genomic variant data. The population genomics field continues to generate some of the largest biomedical datasets in the world, fueling new insights into genetic disease, and generating novel therapeutics that can be personalized to each individual.
In this webinar we make a bold claim: population genomics is a data management problem (or, at least, it is today). We argue the volume of data being generated today has reached a point that could significantly impede further scientific progress. Insights need to be generated by analyzing data at extreme scale (on the order of thousands or millions of genomes, which translates to many PBs). And there is currently no solution to manage and analyze the data in a practical way, in terms of performance, cost, ease of use and accessibility.
At TileDB, we have been working on this problem for a long time (since my days at Intel Labs and MIT, and the collaboration we had with the Broad Institute). The main goal of this webinar is twofold:
I was very fortunate to be joined by my colleague Aaron Wolen, who covered the technical aspects of the TileDB solution and provided several demonstrations of this solution in action. We were also immensely honored to have Dr. Stephen Kingsmore, President and CEO of the Rady Children’s Institute for Genomic Medicine, as a guest speaker. Dr. Kingsmore presented his pioneering work on genome-informed pediatric care and explained how TileDB can facilitate it.
You can find the full webinar recording below. In the following sections I break down the video into smaller clips and provide the gist for easier consumption. Enjoy!
We are extremely privileged to have an amazing team that executed this challenging work and built a superb solution. We’re also very fortunate to have an outstanding partner, Helix, who we have been working with for many years and who have helped us immensely in deeply understanding the use case and optimizing TileDB to the extreme.
Superficially, the data management problem in population genomics stems from a reliance on VCF files for data storage. This has been pointed out many times, but large collections of VCF files are extremely difficult to analyze at scale. Various attempts have been made to overcome VCF’s limitations with mixed success.
However, there is a much deeper, more holistic problem: Data management is nowhere to be found; organizations are stuck dealing with flat files and custom tools. In essence, the whole data economics in Genomics is flawed. That includes:
So how do we solve the problems outlined above? Here is a foundational approach, a “solution template” if you will, that includes a set of recommendations for anyone building data systems for population-scale genomics (or any domain for that matter, but I am leaving that for a separate blog post):
Is the above even possible? To prove that it is, in the next section I explain how we solved the problem with TileDB, following the above approach.
We took the following decisions with TileDB:
We have battle-tested our solution on datasets in the order of hundreds of TBs with our customers. The live tutorials Aaron presented in the webinar (included below) offer a good taste of the TileDB solution, but we are happy to provide more information and tutorials upon request.
We were honored to feature Dr. Kingsmore in our webinar, who presented his great work around using population genomics data technologies to solve incredibly difficult problems for inpatient pediatric care. He also described how TileDB can facilitate the analysis and accessibility of large genomic datasets for his 72 other partner sites. We are super excited about our partnership with Dr. Kingsmore and Rady Children’s Institute for Genomic Medicine.
As mentioned above, TileDB-VCF is our purpose-built library for managing genomic variant data. As you’ll see in the video, it couldn’t be easier to get up and running: simply create a new TileDB-VCF dataset and point to the VCF files you want to ingest. The data is ingested losslessly, so you can always recreate the original VCFs. But storing this data in TileDB allows you to efficiently slice it by sample and/or genomic region, and rapidly add new samples to the dataset (solving the N+1 problem). And importantly, all of these operations can be parallelized and performed in the cloud, providing a powerful foundation for running genome-wide tasks across huge populations.
Notebook used:
Aaron demonstrated how analyses can be performed genome-wide using TileDB-VCF to partition the genome into discrete bins and TileDB Cloud to process each one independently and in parallel. In this case, a series of user-defined functions (UDFs) were orchestrated in a pipeline to calculate population-level metrics, filter variants using those metrics, and then perform an association analysis. All without having to download any data or fiddle with clusters. Furthermore, the variant data, notebooks, and UDFs were all registered on TileDB Cloud, so they’re easily shareable, allowing others to reproduce or extend your work.
Notebook used:
We can use this same fundamental approach to quickly find a specific set of variants in a mountain of variant data. For customers like Rady, this means building queries that integrate many different types of data, including patient data, results from internal statistical analyses, and external genomic annotations. The power of TileDB's universality is that all of these data types can be stored in a single format and integrated into a sophisticated query to find and annotate the relevant results. In this final demo, you’ll see how to enrich a parallelized VCF query by incorporating sample metadata from the Thousand Genomes Project and gene/exon annotations from Ensembl, all using serverless UDFs to build a single distributed task graph.
Notebook used:
We have a long list of features and optimizations coming up. You can also suggest a feature and tell us about your needs on our feedback site. This raises an additional important point: we argue that the community should depart from domain-specific and inflexible formats, as well as custom-made tools, and transition to generic storage engines and data management platforms that can evolve and encompass future technological advancements, without affecting the user APIs and downstream applications. For example, by adopting an engine like TileDB, which will always keep up-to-date with the latest storage backends, computational hardware and data science tools, you will never see your custom formats and tools being rendered obsolete and requiring a complete rearchitecting due to a particular technology shift. TileDB gets inspired and improved by customers and challenging applications that span way beyond Genomics, and all optimizations are inherited and enjoyed by all users regardless of the application domain. So stay tuned!
We were happy to receive a lot of questions from the audience during the live session. We are of course very happy to answer your questions offline, and there are several ways to contact us.
Here are the slides I used in the webinar.
A few final remarks:
Last but not least, a huge thank you to the entire team for all the amazing work. I am just a mere representative and am the exclusive recipient of complaints. All the credit always goes to our awesome team!