Back

Jun 08, 2023

TileDB 101: Population Genomics

Genomics
17 min read
Jeremy Leipzig

Jeremy Leipzig

Senior Software Engineer

TileDB 101: Population Genomics

In this article I will cover the very basics of TileDB’s support for variant (VCF) data. TileDB is a database system that is based on (dense and sparse) multi-dimensional arrays. This data structure is ideal for handling population-level VCF data, by modeling it as a 3D sparse array, indexing on contigs, positions and samples for rapid slicing across those fields. For that reason, we have developed a special open-source library called TileDB-VCF on top of the (also open-source) TileDB Embedded array engine, which allows you to easily and efficiently ingest and query VCF data at large scale.

This is an introductory article into TileDB-VCF. I will focus on explaining simple ingestion, querying and updating of variant data, both locally as well as on the cloud. I will leave the internal mechanics and very large-scale experiments for future articles.

You will be able to run all the code examples locally on your laptop. For convenience, you are also able to directly download and/or launch this Jupyter notebook directly on TileDB Cloud (sign up and we will grant you free credits).

Setup

To run the code you will need the following installations:

# Create a new conda environment with the following settings,
# if you are on a M1 Mac
CONDA_SUBDIR=osx-64 conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf
conda config --env --set subdir osx-64

# Install TileDB-Py and TileDB-VCF, align with other useful libraries
conda install -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy

# Install TileDB Embedded SQL
conda install -c conda-forge libtiledb-sql-py 

# Install the TileDB Cloud client to be able to work with TileDB Cloud
pip install tiledb-cloud

Confirm that the libraries were installed correctly:

import tiledb
import tiledbvcf
import tiledb.cloud
import tiledb.cloud.groups
import tiledb.sql
import pandas as pd
import numpy as np

Ingestion

We will use samples from the latest DRAGEN 3.5.7b re-analysis of the 1000 Genomes dataset.

# Prepare the appropriate URI paths
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"
samples_to_ingest = ["HG00096_chr21.gvcf.gz",
                     "HG00097_chr21.gvcf.gz", 
                     "HG00099_chr21.gvcf.gz", 
                     "HG00100_chr21.gvcf.gz", 
                     "HG00101_chr21.gvcf.gz"]
sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_ingest]

# The URIs of the samples to be ingested should look like this:
sample_uris

The URIs of the samples to be ingested should look like this:

['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00096_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00097_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00099_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00100_chr21.gvcf.gz',
 's3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen/HG00101_chr21.gvcf.gz']

To ingest these samples locally, run the following:

# Choose a dataset name
vcf_uri = "my_vcf_dataset"

# Open a VCF dataset in write mode
ds = tiledbvcf.Dataset(uri=vcf_uri, mode="w")

# Create an empty VCF dataset
# Also enable the generation of some variant stats upon ingestion
ds.create_dataset(enable_variant_stats=True, enable_allele_count=True) 

# Ingest the selected samples
ds.ingest_samples(sample_uris = sample_uris)

This process will ingest the VCF data directly from S3 into a local VCF dataset, without needing to download the source VCF files beforehand. The ingestion should take about a minute from your laptop. The ingestion time here is dominated by S3 throughput, plus parsing overhead of the htslib library that TileDB-VCF is using internally to ingest the VCF files.

We are all set, here is how the new VCF dataset looks like on-disk:

!tree {vcf_uri}
my_vcf_dataset
├── __group
│   ├── __1686149252374_1686149252374_05a93d0ca2744178b04e1685a560aa9a_2
│   ├── __1686149252383_1686149252383_441be9ec9e914b6a8435c3d9c33e571b_2
│   ├── __1686149252386_1686149252386_f2b3d3664cb4443bb857042b498d50ff_2
│   └── __1686149252410_1686149252410_19ebbef3b28b458eb1a0c340d6e35615_2
├── __meta
│   └── __1686149252365_1686149252365_f635187a902d491081c711ac55955ccb
├── __tiledb_group.tdb
├── allele_count
│   ├── __commits
│   │   └── __1686149313569_1686149313569_8f0cc947b69445b5ac8c01d9454481e4_18.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __1686149313569_1686149313569_8f0cc947b69445b5ac8c01d9454481e4_18
│   │       ├── __fragment_metadata.tdb
│   │       ├── a0.tdb
│   │       ├── a0_var.tdb
│   │       ├── a1.tdb
│   │       ├── a1_var.tdb
│   │       ├── a2.tdb
│   │       ├── a2_var.tdb
│   │       ├── a3.tdb
│   │       ├── a3_var.tdb
│   │       ├── a4.tdb
│   │       ├── d0.tdb
│   │       ├── d0_var.tdb
│   │       └── d1.tdb
│   ├── __labels
│   ├── __meta
│   │   ├── __1686149252385_1686149252385_0805633cb821438aadfd9a26af0785fb
│   │   └── __1686149256344_1686149256344_80c9e93a0f174891a256fc620a17e2e0
│   └── __schema
│       └── __1686149252384_1686149252384_e994ef8de61c4df98329182150b59138
├── data
│   ├── __commits
│   │   ├── __1686149313536_1686149313536_56646535e0534e28a2dd375ac38b92e0_18.wrt
│   │   └── __1686149340288_1686149340288_97cc95472c724277828b9396ef9c1e59_18.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   ├── __1686149313536_1686149313536_56646535e0534e28a2dd375ac38b92e0_18
│   │   │   ├── __fragment_metadata.tdb
│   │   │   ├── a0.tdb
│   │   │   ├── a1.tdb
│   │   │   ├── a2.tdb
│   │   │   ├── a3.tdb
│   │   │   ├── a3_var.tdb
│   │   │   ├── a4.tdb
│   │   │   ├── a4_var.tdb
│   │   │   ├── a5.tdb
│   │   │   ├── a5_var.tdb
│   │   │   ├── a6.tdb
│   │   │   ├── a6_var.tdb
│   │   │   ├── a7.tdb
│   │   │   ├── a7_var.tdb
│   │   │   ├── a8.tdb
│   │   │   ├── a8_var.tdb
│   │   │   ├── d0.tdb
│   │   │   ├── d0_var.tdb
│   │   │   ├── d1.tdb
│   │   │   ├── d2.tdb
│   │   │   └── d2_var.tdb
│   │   └── __1686149340288_1686149340288_97cc95472c724277828b9396ef9c1e59_18
│   │       ├── __fragment_metadata.tdb
│   │       ├── a0.tdb
│   │       ├── a1.tdb
│   │       ├── a2.tdb
│   │       ├── a3.tdb
│   │       ├── a3_var.tdb
│   │       ├── a4.tdb
│   │       ├── a4_var.tdb
│   │       ├── a5.tdb
│   │       ├── a5_var.tdb
│   │       ├── a6.tdb
│   │       ├── a6_var.tdb
│   │       ├── a7.tdb
│   │       ├── a7_var.tdb
│   │       ├── a8.tdb
│   │       ├── a8_var.tdb
│   │       ├── d0.tdb
│   │       ├── d0_var.tdb
│   │       ├── d1.tdb
│   │       ├── d2.tdb
│   │       └── d2_var.tdb
│   ├── __labels
│   ├── __meta
│   │   └── __1686149252411_1686149252411_2103281736bd42c9820106f6b673fc9b
│   └── __schema
│       └── __1686149252375_1686149252375_f5835ff79aee45aba330483aba26b880
├── metadata
│   ├── __group
│   ├── __meta
│   ├── __tiledb_group.tdb
│   └── vcf_headers
│       ├── __commits
│       │   └── __1686149305684_1686149305684_69b0f1b94b5a4c819b459bb50fa28b3b_18.wrt
│       ├── __fragment_meta
│       ├── __fragments
│       │   └── __1686149305684_1686149305684_69b0f1b94b5a4c819b459bb50fa28b3b_18
│       │       ├── __fragment_metadata.tdb
│       │       ├── a0.tdb
│       │       ├── a0_var.tdb
│       │       ├── d0.tdb
│       │       └── d0_var.tdb
│       ├── __labels
│       ├── __meta
│       └── __schema
│           └── __1686149252373_1686149252373_8e6bd59f23df437fac72577da3592f72
└── variant_stats
    ├── __commits
    │   └── __1686149313580_1686149313580_395e3bac84464eb4a6f519157cbcfb22_18.wrt
    ├── __fragment_meta
    ├── __fragments
    │   └── __1686149313580_1686149313580_395e3bac84464eb4a6f519157cbcfb22_18
    │       ├── __fragment_metadata.tdb
    │       ├── a0.tdb
    │       ├── a0_var.tdb
    │       ├── a1.tdb
    │       ├── a2.tdb
    │       ├── a3.tdb
    │       ├── d0.tdb
    │       ├── d0_var.tdb
    │       └── d1.tdb
    ├── __labels
    ├── __meta
    │   ├── __1686149252388_1686149252388_1e4b31349f43461bae6e228ff9cfd81c
    │   └── __1686149256344_1686149256344_c70a8f5534a04816b2f8fa047d1468a2
    └── __schema
        └── __1686149252387_1686149252387_bc56e1c700744c76b80af34d4d41cd23

This is a TileDB “group” that contains some TileDB “arrays”, but do not worry about that for now. All you need to know is that TileDB ingested the data in a lossless, columnar, compressed, and indexed manner that allows you to run rapidly the following:

  • Queries slicing on samples, contig and/or position.
  • Queries sub selecting on the various VCF fields (e.g., ALT, REF, etc)

As we’ll see later, you are able to export back to the original VCF format, even to CombinedVCF format!

Next, let’s inspect the data we ingested.

Reading

To access any information from the VCF dataset, you first need to open it in read mode:

ds = tiledbvcf.Dataset(uri=vcf_uri, mode="r")

Let’s see what samples our dataset contains:

ds.samples()
['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

These are the same as the ones we ingested.

Next, let’s see what “attributes” (i.e., VCF fields) we can access in later queries.

ds.attributes()
['alleles',
 'contig',
 'filters',
 'fmt',
 'fmt_AD',
 'fmt_AF',
 'fmt_DP',
 'fmt_F1R2',
 'fmt_F2R1',
 'fmt_GP',
 'fmt_GQ',
 'fmt_GT',
 'fmt_ICNT',
 'fmt_MB',
 'fmt_MIN_DP',
 'fmt_PL',
 'fmt_PRI',
 'fmt_PS',
 'fmt_SB',
 'fmt_SPL',
 'fmt_SQ',
 'id',
 'info',
 'info_DB',
 'info_DP',
 'info_END',
 'info_FS',
 'info_FractionInformativeReads',
 'info_LOD',
 'info_MQ',
 'info_MQRankSum',
 'info_QD',
 'info_R2_5P_bias',
 'info_ReadPosRankSum',
 'info_SOR',
 'pos_end',
 'pos_start',
 'qual',
 'query_bed_end',
 'query_bed_line',
 'query_bed_start',
 'sample_name']

We can read data with a very simple command, which allows us to select samples, attributes and genomic regions to slice on:

df = ds.read(
    regions = ["chr21:8220186-8405573"],
    samples = ['HG00096', 'HG00097'],
    attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

The output is a pandas dataframe:

Once we have efficiently queried the data of interest, we can manipulate them using pandas in a variety of ways, for example:

# Pivot so that each sample is a column and it displays the GT
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "/".join(map(str, x)))
df["fmt_GT"] = df["fmt_GT"].apply(lambda x: "./." if x == "-1/-1" else x)
df.pivot(index="pos_start", columns="sample_name", values="fmt_GT")

Variant stats

When we created an empty VCF dataset (before ingesting the samples), we instructed TileDB-VCF to generate some variant statistics when ingesting new samples. Those statistics are found in two arrays (think of them as dataframes for simplicity) stored inside the VCF dataset folder, called allele_count and variant_stats. These provide allele count internal allele frequency information that can be used for filtering variants or generating summary reports.

The allele_count array can be used to provide summary statistics about variants in a TileDB variant store without having to scan individual sample records. This can be used for generating lists of variants to be annotated, for example.

You can access allele_count and variant_stats as follows:

# Retrieve the array URIs, so that we can access them directly using TileDB-Py.
# Note that those arrays are inside the VCF dataset, which is a TileDB "group".
allele_count_uri = tiledb.Group(vcf_uri)["allele_count"].uri
variant_stats_uri = tiledb.Group(vcf_uri)["variant_stats"].uri

# Query the allele_count array - the results are a pandas dataframe
A = tiledb.open(allele_count_uri)
A.df[:]

# Query the variant_stats array - the results are a pandas dataframe
B = tiledb.open(variant_stats_uri)
B.df[:]

The fields in the variant stats array powers internal allele frequency (IAF) calculations - the relative abundance of specific mutations at each position in a TileDB variant store. In conjunction with global minor allele frequency information like that provided by gnomAD, IAF can be a valuable tool for variant prioritization.

I can even query these arrays using TileDB’s embedded SQL support:

# Ignore any warnings
db = tiledb.sql.connect()
pd.read_sql(sql=f"select * from `{variant_stats_uri}` where pos >= 5030025 and pos <= 5030087", con=db)

If you have issues with long URIs being passed into the read_sql command, you need to do the following:

db = tiledb.sql.connect(init_command="create table <small_name> engine=mytile uri='<long_URI>'")

pd.read_sql(sql="select * from <small_name> where pos >= 5030025 and pos <= 5030087", con=db)

You can read the TileDB 101: Dataframes blog post to learn more about how to query dataframes using TileDB.

Export back to VCF

TileDB-VCF ingests VCF samples in a lossless manner, and allows you to export the data back into the original VCF format, or into a combined VCF format.

To export a sample into its original VCF file, you can run the following:

ds.export(
    regions = ["chr21:8220186-8405573"],
    samples = ['HG00101'],
    output_format = 'v',
    output_dir = './'
)

Note that you can export the whole file, or any genomic region, which is very convenient.

To export multiple samples into a CombinedVCF file, you can run:

ds.export(
    regions = ['chr21:8220186-8405573'],
    samples = ds.samples()[0:2],
    merge = True,
    output_format = 'v',
    output_path = './combined.vcf'
)

You can confirm that the data has been exported correctly using bcftools:

!bcftools view --no-header HG00101.vcf | head -10
!bcftools view --no-header combined.vcf | head -10

Adding new samples

One of the most important features of TileDB-VCF is that it solves the so-called N+1 problem. Simplistically, this means that you can add new samples into an existing VCF dataset without having to suffer from explosion of the dataset size or the update time. TileDB-VCF stores all the sample data in their original “sparse” form, without having to inject any additional redundant information (as it happens in the CombinedVCF format). Moreover, it works excellently on cloud object stores due to its use for immutable writes, which implies that the new sample ingestion cost is independent of the previous samples that exist in the dataset. Finally, TileDB-VCF inherits TileDB’s native versioning and time traveling support (but we will cover those features in a future article).

Suppose we’d like to add five more samples to the VCF dataset we have already created.

samples_to_add = ["HG00102_chr21.gvcf.gz", 
                  "HG00103_chr21.gvcf.gz", 
                  "HG00105_chr21.gvcf.gz", 
                  "HG00106_chr21.gvcf.gz", 
                  "HG00107_chr21.gvcf.gz"]
new_sample_uris = [f"{vcf_bucket}/{s}" for s in samples_to_add]
new_sample_uris

All we need to do is open the VCF dataset in write mode and ingest these samples, similar to what we did the first time (but this time we will omit creating an empty dataset, as this already exists).

ds = tiledbvcf.Dataset(uri="vcf_uri", mode="w")
ds.ingest_samples(sample_uris = new_sample_uris)

Like before, we can now open the VCF dataset in read mode and access the data.

ds = tiledbvcf.Dataset(uri="vcf_uri", mode="r")

# Now there are 10 samples
ds.samples()
['HG00096',
 'HG00097',
 'HG00099',
 'HG00100',
 'HG00101',
 'HG00102',
 'HG00103',
 'HG00105',
 'HG00106',
 'HG00107']

# Read like we did before
df = ds.read(
    regions = ["chr21:8220186-8405573"],
    attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

Working with cloud object stores

TileDB-VCF inherits TileDB’s cloud-native properties and, hence, works out of the box on AWS S3, Google Cloud Storage, Azure Blob Storage and pretty much any other object store that has S3 API compatibility (e.g., minio).

To use TileDB-VCF with S3 you need only two minor changes:

  • Use an S3 URI for your VCF dataset name, instead of a local one
  • Configure your AWS S3 credentials

Here is an example:

# This is my VCF dataset S3 URI
vcf_s3_uri = "s3://tiledb-stavros/groups/my_vcf_dataset"

# This my config with my AWS S3 credentials
s3_config = {
     "vfs.s3.aws_access_key_id": "<access_key_id>",
     "vfs.s3.aws_secret_access_key": "<secret_access_key>",
}
s3_cfg = tiledbvcf.ReadConfig(tiledb_config=s3_config)

# Pass the config when opening the VCF dataset
ds = tiledbvcf.Dataset(uri=vcf_s3_uri, mode="w", cfg=s3_cfg)

# Ingest samples as we did locally
ds.create_dataset(enable_variant_stats=True, enable_allele_count=True) 
ds.ingest_samples(sample_uris = sample_uris)

# Open the VCF dataset in read mode and pass the config
ds = tiledbvcf.Dataset(uri=vcf_s3_uri, mode="r", cfg=s3_cfg)

# Read in an identical manner to what we did locally
df = ds.read(
    regions = ["chr21:8220186-8405573"],
    samples = ['HG00096', 'HG00097'],
    attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

TileDB-VCF works similarly with other Cloud object stores. See TileDB 101: Cloud Object Storage for more information on that.

TileDB Cloud

TileDB Cloud is a full-fledged database platform that allows you to catalog, share, and analyze data of multiple modalities (including VCF) at a very large scale. Here I will cover only how you can register an already existing VCF dataset on S3 to TileDB Cloud (sign up and we will grant you free credits).

First you need to login as follows (this step is not needed if you are running a notebook inside TileDB Cloud):

tiledb.cloud.login(username="<username>", password="<password>")

Next, you need to register the VCF dataset as a group (recall that we mentioned that the VCF dataset is in fact a TileDB “group” - you can safely ignore the details for now):

tiledb.cloud.groups.register(vcf_s3_uri, name="my_vcf_dataset", namespace="stavros")

After this command, TileDB Cloud will have access to the VCF dataset and you will be able to see it in the list of your groups:

Groups section of TileDB Cloud. Listed are my_vcf_dataset and nyc_example.Inspecting contents of my_vcf_dataset

From the TileDB Cloud console, you will now be able to share this dataset with other collaborators, see all activity, etc (you can also do everything programmatically using a TileDB-Cloud client, which is covered in other articles).

Similar to accessing the VCF dataset from S3, to access it from TileDB Cloud all you need to do is change the VCF dataset URI to point to tiledb:// and pass an appropriate config with your TileDB Cloud credentials.

# This is the TileDB Cloud URI of my VCF dataset
cloud_uri = "tiledb://stavros/my_vcf_dataset"

# This is the config I need to pass
cloud_config = tiledb.Config()
cloud_config["rest.username"] = "<username>"
cloud_config["rest.password"] = "<password>"
cloud_cfg = tiledbvcf.ReadConfig(tiledb_config=cloud_config)

# Open the VCF dataset in read mode passing the config
ds = tiledbvcf.Dataset(uri=cloud_uri, mode="r", cfg=cloud_cfg)

# Read like we did locally
df = ds.read(
    regions = ["chr21:8220186-8405573"],
    samples = ['HG00096', 'HG00097'],
    attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

Stay tuned

That’s all you need to know to get jumpstarted on TileDB-VCF! We will follow up with a series of articles covering from large-scale ingestion and querying, to detailed benchmarks with popular public datasets, to the TileDB-VCF internal mechanics that make TileDB’s variant data support magical.

Want to see TileDB in action?
Jeremy Leipzig

Jeremy Leipzig

Senior Software Engineer