Would you rather cure one patient of a rare disease, or cure thousands suffering from a similar ailment? The answer is obvious, but scaling the discovery of rare disease treatments has proved challenging because of the sheer complexity involved in population genomics. National biobank initiatives, such as UK Biobank and All of Us, have yielded vast repositories with hundreds of thousands of whole human genomes. According to Illumina, at least 4 million individuals have been fully sequenced, with each individual’s genome yielding 4–5 million single nucleotide polymorphisms (SNPs) as well as countless variations.
Despite the complexity, there is immense promise in this frontier data. "Analyzing a diverse set of omics data of thousands of patients can identify novel mechanisms of diseases. Discovering these new paths and eventual targets will allow us to develop innovative interventions to treat diseases." said Dr. Konstantinos Lazaridis, the Executive Director for the Mayo Clinic's Center for Individualized Medicine. However, traditional variant call file (VCF) formats have not been able to handle large-scale queries, leaving the promise of population genomics largely unmet.
To master the data challenges of managing variant call data at scale, TileDB created an efficient and flexible VCF solution through its multidimensional array-based architecture. Let’s examine how variant data enables new discoveries and TileDB empowers rare disease researchers to make the most of their genomics data and overcome their data management challenges.
The complexity of population genomics data is both the source of its difficulty and its potential. Beyond biobanks, analogous large-scale sequencing endeavors with tens of millions of individuals have targeted agricultural crops, model organisms, microbes and even companion animals, producing extensive sets of variants. All these variants form the cornerstone of population genomics, leveraging sheer scale and genetic variability to drive breakthroughs in our grasp of biology and medicine.
Variant calling is the process of predicting genomic variants and genotypes from reads aligned to a reference genome. Algorithms weigh various quality metrics of these aggregated sequencing reads, which can be visualized as stacks or “pileups” to generate calls of positions. These may contain SNPs, which can include substitutions, insertions or deletions (indels) as well as larger structural variants (SVs) and copy number variants (CNVs). To manage this complexity, the 1000 Genomes Project developed Variant call files (VCFs) to combine variant calls and genotype calls with technical metadata (depth, quality and confidence metrics) along with extensible annotation at the locus and sample level.
VCF has proven an effective means of transmitting variants and associating annotation at the sample level. However, reviewing the VCF-tagged questions in Biostars, a popular Q&A site for bioinformatics, shows many people are trying to use VCF files as a kind of ad-hoc database instead of a conduit for transmitting variant information. This problematic approach suffers from limitations like:
This led TileDB to architect a population genomics solution around TileDB-VCF, which is an open-source library for efficient and lossless storage, access and exporting of variant data. And because TileDB-VCF is built on the TileDB Carrara array engine, it models population VCF data as 3-dimensional sparse arrays. In addition, TileDB Carrara facilitates secure data sharing and collaborative research through its trusted research environment. Some of TileDB-VCF’s most important benefits include:
These capabilities led Rady Children’s Hospital’s Institute of Genomic Medicine to adopt TileDB as a cost-effective and scalable database solution that could deliver a short turnaround time for diagnostics and efficiently manage volumes of genomic variant data at scale. They chose TileDB to handle their VCF samples in a 3-dimensional array on Amazon S3 using the TileDB-VCF open library, making this data analysis-ready on cloud storage.
The results more than achieved the efficiency and cost-effectiveness Rady Children’s Institute of Genomic Medicine needed, with a striking 97% cost reduction compared to their legacy file-based approach. For more on how Rady Children’s Hospital is managing their genomic data at scale, read the full case study.