Customer Paper
Domain
Genomics
Datatypes
Variant data
Clinical context
For the past 20 years, NBS has used mass spectrometry (MS) for biochemical analysis of dried blood spot samples, which are pricked from a baby's heel within the first few days of life. While NBS-MS is a mature and well-understood process, it detects 37 core genetic disorders (plus a secondary list of 26 as of February 2023), according to the Recommended Uniform Screening Panel from the HRSA, a U.S. federal agency.
The team at RCIGM knew from clinical experience that rWGS is capable of identifying a much wider range of genetic disorders and over a much wider population of children. They assembled a group of experts to identify 388 genetic diseases that are medically actionable to include in an expanded NBS-rWGS effort. With a wider range of genetic disorders and reasonable clinical testing costs on the horizon, the next step was to evaluate the approach using historical data.
Testing TileDB queries
To gauge the accuracy of the expanded screening panel, the team assembled a test dataset of known genetic diagnoses. RCIGM collaborated with a range of clinical and biotechnology industry experts to evaluate the false-positive rate within a cohort of "4,376 critically ill children and their parents who received rWGS at RCIGM for diagnosis of suspected genetic disorders”. The cohort's Variant Call Format (VCF) samples were ingested into a 3-dimensional TileDB array on Amazon S3 using the TileDB-VCF open library, making this data analysis-ready on cloud storage. TileDB queries were refined until they were within an acceptable false-positive rate. To ensure statistical significance, they evaluated false-positives against 454,707 whole exome sequences from the UK Biobank, bringing the rate to 0.27%.
Solving n + 1 with TileDB
The team knew that as the number of NBS-rWGS disorders grows over time, so too will the computational complexity of population genomics itself. This is known as the n + 1 problem, where researchers would prefer to avoid reexamining every Genomic VCF (gVCF) when a new variant is introduced into the cohort, in order to determine whether the other subjects are in fact reference or simply have no read coverage at that position.
The typical data engineering solution to this problem would be periodic batch processing jobs; however, since each human genome closely overlaps with reference assemblies (99.8% similarity, or ~5 million unique base pairs among ~3 billion genomic positions), the sparse nature of this data makes batch processing tedious and expensive. Because TileDB natively represents sparse data on cloud object storage — neatly compressed and without bloated filler values — Dr. Kingsmore and his collaborators built the data management system for their NBS-rWGS program around sparse TileDB arrays and evaluated the cost reduction of n + 1 sample ingestion using a c6 g.xlarge Amazon EC2 instance. Here are the highlights:
Shortening the diagnostic odyssey
Reducing the diagnostic burden on clinicians was the ultimate goal. Incorporating a wide range of annotation data — particularly through TileDB's support for Fabric GEM™ , an AI-driven genetic diagnosis tool — was a key component. Faster data access and more efficient n + 1 computations using TileDB and AWS were significant benefits that contributed to automated and accurate genetic diagnoses, critical to clinicians in the NICU who may not have genomics expertise.
In this dynamic environment, TileDB is positioned to allow clinicians to revisit the pediatric disease landscape as four key sources of information grow:
These improvements are currently supporting another effort led by RCIGM, Genome-to-Treatment (GTRx), an automated system for genetic diagnosis and acute management guidance. By speeding-up NBS-rWGS data analysis and feeding results into GTRx, RCIGM and its collaborators are working to improve clinical outcomes worldwide.
2023 Update
TileDB for population genomics
A year since publication in spring 2022, here's an update on TileDB's role: