Customer Paper

TileDB supports expanded newborn screening and genetic diagnosis program for Rady Children’s Hospital
Rady Children’s Hospital and Rady Children’s Institute for Genomic Medicine  (RCIGM) in San Diego are at the forefront of applying rapid whole-genome sequencing (rWGS) to newborn screening (NBS). In a 2022 paper  published in The American Journal of Human Genetics, lead author Dr. Stephen Kingsmore, President and CEO of RCIGM, describes the methods used to design a new, faster and comprehensive form of NBS built on rWGS technology (NBS-rWGS). The experiments involved retroactive analysis of large amounts of genomic variant data, which was managed by TileDB for fast and cost-efficient access.




Variant data

Clinical context

For the past 20 years, NBS has used mass spectrometry (MS) for biochemical analysis of dried blood spot samples, which are pricked from a baby's heel within the first few days of life. While NBS-MS is a mature and well-understood process, it detects 37 core genetic disorders (plus a secondary list of 26 as of February 2023), according to the Recommended Uniform Screening  Panel from the HRSA, a U.S. federal agency.

The team at RCIGM knew from clinical experience that rWGS is capable of identifying a much wider range of genetic disorders and over a much wider population of children. They assembled a group of experts to identify 388 genetic diseases that are medically actionable to include in an expanded NBS-rWGS effort. With a wider range of genetic disorders and reasonable clinical testing costs on the horizon, the next step was to evaluate the approach using historical data.

Testing TileDB queries

To gauge the accuracy of the expanded screening panel, the team assembled a test dataset of known genetic diagnoses. RCIGM collaborated with a range of clinical and biotechnology industry experts to evaluate the false-positive rate within a cohort of "4,376 critically ill children and their parents who received rWGS at RCIGM for diagnosis of suspected genetic disorders”. The cohort's Variant Call Format (VCF) samples were ingested into a 3-dimensional TileDB array on Amazon S3 using the TileDB-VCF open library, making this data analysis-ready on cloud storage. TileDB queries were refined until they were within an acceptable false-positive rate. To ensure statistical significance, they evaluated false-positives against 454,707 whole exome sequences from the UK Biobank, bringing the rate to 0.27%.

Solving n + 1 with TileDB

The team knew that as the number of NBS-rWGS disorders grows over time, so too will the computational complexity of population genomics itself. This is known as the n + 1 problem, where researchers would prefer to avoid reexamining every Genomic VCF (gVCF) when a new variant is introduced into the cohort, in order to determine whether the other subjects are in fact reference or simply have no read coverage at that position.

The typical data engineering solution to this problem would be periodic batch processing jobs; however, since each human genome closely overlaps with reference assemblies (99.8% similarity, or ~5 million unique base pairs among ~3 billion genomic positions), the sparse nature of this data makes batch processing tedious and expensive. Because TileDB natively represents sparse data on cloud object storage — neatly compressed and without bloated filler values — Dr. Kingsmore and his collaborators built the data management system for their NBS-rWGS program around sparse TileDB arrays and evaluated the cost reduction of n + 1 sample ingestion using a c6 g.xlarge Amazon EC2 instance. Here are the highlights:

  • TileDB ingestion from S3 was $0.06 vs. $2.18 using traditional file-based approaches using the same EC2 instance.
  • Reduced the time it took to add a new sample to an existing dataset and compute common variant statistics across the entire population to ~22 minutes.

Shortening the diagnostic odyssey

Reducing the diagnostic burden on clinicians was the ultimate goal. Incorporating a wide range of annotation data — particularly through TileDB's support for Fabric GEM™ , an AI-driven genetic diagnosis tool — was a key component. Faster data access and more efficient n + 1 computations using TileDB and AWS were significant benefits that contributed to automated and accurate genetic diagnoses, critical to clinicians in the NICU who may not have genomics expertise.

In this dynamic environment, TileDB is positioned to allow clinicians to revisit the pediatric disease landscape as four key sources of information grow:

  • Biobank-scale population frequencies and associated phenotypes
  • Patient genomic variant databases
  • Curated variant annotation databases
  • Interventions

These improvements are currently supporting another effort led by RCIGM, Genome-to-Treatment (GTRx), an automated system for genetic diagnosis and acute management guidance. By speeding-up NBS-rWGS data analysis and feeding results into GTRx, RCIGM and its collaborators are working to improve clinical outcomes worldwide.

2023 Update

TileDB for population genomics

A year since publication in spring 2022, here's an update on TileDB's role:

  • Managing group of arrays, totaling ~13 TB.
  • RCIGM achieved 7-hour clinical turnaround time, of which TileDB loads new samples in minutes and returns queries in seconds.
  • Optimized allele frequency calculations and other TileDB improvements further drove down query costs to $0.03.
  • Working to move UKBB WGS storage to TileDB for faster iterative analysis.
quotation-firstWatch the video:  Dr. Stephen Kingsmore builds a compelling case for TileDB as the only solution that can handle PBs of genomic data and nightly analysis to impact clinical analysis in the NICUquotation-last
Dr. Stephen Kingsmore
President/CEO, Rady Children's Institute for Genomic Medicine

Quest Diagnostics

Want to learn more about TileDB Cloud?