Scaling Genomic Data Management with TileDB

Table Of Contents:

How to scale genomic data management with TileDB

How variant data and VCF files unlocked new possibilities in genomics

TileDB-VCF: An efficient data management solution for variant data

How to scale genomic data management with TileDB

Would you rather cure one patient of a rare disease, or cure thousands suffering from a similar ailment? The answer is obvious, but scaling the discovery of rare disease treatments has proved challenging because of the sheer complexity involved in population genomics. National biobank initiatives, such as UK Biobank and All of Us, have yielded vast repositories with hundreds of thousands of whole human genomes. According to Illumina, at least 4 million individuals have been fully sequenced, with each individual’s genome yielding 4–5 million single nucleotide polymorphisms (SNPs) as well as countless variations.

Despite the complexity, there is immense promise in this frontier data. "Analyzing a diverse set of omics data of thousands of patients can identify novel mechanisms of diseases. Discovering these new paths and eventual targets will allow us to develop innovative interventions to treat diseases." said Dr. Konstantinos Lazaridis, the Executive Director for the Mayo Clinic's Center for Individualized Medicine. However, traditional variant call file (VCF) formats have not been able to handle large-scale queries, leaving the promise of population genomics largely unmet.

To master the data challenges of managing variant call data at scale, TileDB created an efficient and flexible VCF solution through its multidimensional array-based architecture. Let’s examine how variant data enables new discoveries and TileDB empowers rare disease researchers to make the most of their genomics data and overcome their data management challenges.

How variant data and VCF files unlocked new possibilities in genomics

The complexity of population genomics data is both the source of its difficulty and its potential. Beyond biobanks, analogous large-scale sequencing endeavors with tens of millions of individuals have targeted agricultural crops, model organisms, microbes and even companion animals, producing extensive sets of variants. All these variants form the cornerstone of population genomics, leveraging sheer scale and genetic variability to drive breakthroughs in our grasp of biology and medicine.

Variant calling is the process of predicting genomic variants and genotypes from reads aligned to a reference genome. Algorithms weigh various quality metrics of these aggregated sequencing reads, which can be visualized as stacks or “pileups” to generate calls of positions. These may contain SNPs, which can include substitutions, insertions or deletions (indels) as well as larger structural variants (SVs) and copy number variants (CNVs). To manage this complexity, the 1000 Genomes Project developed Variant call files (VCFs) to combine variant calls and genotype calls with technical metadata (depth, quality and confidence metrics) along with extensible annotation at the locus and sample level.

VCF has proven an effective means of transmitting variants and associating annotation at the sample level. However, reviewing the VCF-tagged questions in Biostars, a popular Q&A site for bioinformatics, shows many people are trying to use VCF files as a kind of ad-hoc database instead of a conduit for transmitting variant information. This problematic approach suffers from limitations like:

VCF and its associated command-line tools are not a database system, and they will never support region and sample queries at scale in the era of national biobanks. Even VCF’s usefulness in transmitting variants is unsustainable past a few thousand samples, and annotation can also be difficult since everything needs to be serialized into the INFO field.
VCF files are monolithic. Adding a new sample can introduce new variants, forcing every sample to be re-interrogated at that position to verify if it was indeed referenced or had insufficient coverage. In addition, this slows performance when storing the new sample alongside the rest of the dataset in an issue known as the “N+1” problem.
While indexes provided by Tabix or BCFtools can aid in range queries, these don’t help with joins against phenotypic data or other omic stacks. Bespoke or ad hoc solutions typically try to simplify the underlying data (e.g. storing only genotype calls or a fixed set of loci) but these measures greatly limit the usefulness of the variant store.
The current shift away from joint genotyping and toward single sample gVCFs as the preferred currency further muddies the waters.

TileDB-VCF: An efficient data management solution for variant data

This led TileDB to architect a population genomics solution around TileDB-VCF, which is an open-source library for efficient and lossless storage, access and exporting of variant data. And because TileDB-VCF is built on the TileDB Carrara array engine, it models population VCF data as 3-dimensional sparse arrays. In addition, TileDB Carrara facilitates secure data sharing and collaborative research through its trusted research environment. Some of TileDB-VCF’s most important benefits include:

Performance: TileDB-VCF is optimized for rapidly slicing variant records by genomic regions across multiple samples, with features implemented in C++ for speed.
Compressibility: TileDB-VCF efficiently stores samples in a compressed, lossless manner, using columnar format to apply different compressors based on data types.
Optimized for cloud: TileDB-VCF inherits features from the TileDB core array engine, ensuring speed and optimization for cloud storage like Amazon S3 and Google Cloud.
Solves the N+1 problem: TileDB-VCF enables you to rapidly add new samples, scaling storage and update time linearly. gVCFs are recommended for better handling of reference/no-call blocks.
Cohort level variant stats and allele count data: TileDB-VCF provides allele counts and zygosity for internal allele frequency calculations, facilitating summary transformations.
Separation of genomic data and annotation: TileDB supports external annotation tables for efficient queries and updates without revisiting original VCF files.
Multiple APIs: TileDB-VCF provides C++, Java, and Python APIs in addition to a command-line interface.
Integration with other omics data: TileDB-VCF links genomic data with transcriptomes for GxE or genotype-to-phenotype studies, and is compatible with TileDB SOMA for multimodal experiments.
Artificial Intelligence (AI) and Machine Learning (ML) support: TileDB Carrara facilitates AI/ML workflows, saving Tensorflow Keras, PyTorch, and Scikit-Learn models as TileDB arrays and empowering researchers with vector search capabilities and integrations with popular LLMs.
Security, governance, and compliance: TileDB Carrara offers user-configurable encryption for all data assets, in addition to any encryption policies defined at the storage level (e.g., for S3 buckets). It also helps you manage your variant data with configurable access policies to enable secure sharing and collaboration within your organization or across different organizations while logging all activity for auditing purposes.

These capabilities led Rady Children’s Hospital’s Institute of Genomic Medicine to adopt TileDB as a cost-effective and scalable database solution that could deliver a short turnaround time for diagnostics and efficiently manage volumes of genomic variant data at scale. They chose TileDB to handle their VCF samples in a 3-dimensional array on Amazon S3 using the TileDB-VCF open library, making this data analysis-ready on cloud storage.

The results more than achieved the efficiency and cost-effectiveness Rady Children’s Institute of Genomic Medicine needed, with a striking 97% cost reduction compared to their legacy file-based approach. For more on how Rady Children’s Hospital is managing their genomic data at scale, read the full case study.

Meet the authors