In this article I will cover the basics of TileDB’s support for single-cell data. At its core, TileDB is a database system architected around multi-dimensional arrays. This powerful and flexible data structure allows us to easily build any special-purpose database system to capture demanding use cases with complex data. Single-cell sequencing is one of these use cases.
Our single-cell solution is the product of an amazing collaboration with the Chan Zuckerberg Initiative (CZI). Together the TileDB and CZI teams have jointly developed:
From a high level TileDB-SOMA is:
This article covers the basic Python and R functionality of TileDB-SOMA. You can run all examples locally on your machine, or launch the following notebooks directly on TileDB Cloud for convenience (sign up and we’ll give you free credits to get started!):
At the end of the article, I briefly cover how you can scale your analysis and securely share your single-cell datasets and work on TileDB Cloud. Enjoy!
The dataset we’ll be using contains RNA data for 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The raw data was generated by 10X Genomics and is available here. The version of the dataset we’ll be using was processed with this scanpy notebook.
Note
We’re using a small and familiar dataset for the purpose of this tutorial. However, TileDB-SOMA has proven performant with massive datasets like the CZ CELLxGENE Discover Census, which comprises over 33 million cells and 60 thousand genes as of this writing.
The tiledbsoma
Python package is available from both PyPI and Conda, and can be installed via pip
or conda
/mamba
. Here, we’ll create a new Conda environment and install packages we’ll need using mamba
:
mamba create -n tiledbsoma-101 -c tiledb -c conda-forge tiledbsoma-py
mamba activate tiledbsoma-101
Let’s import the tiledbsoma
Python package as well as a few other packages we’ll use in this tutorial:
import tiledbsoma
import tiledbsoma.io
import tiledb
import anndata as ad
import scanpy as sc
tiledbsoma.show_package_versions()
tiledbsoma.__version__ 1.4.0
TileDB-Py tiledb.version() (0, 22, 2)
TileDB core version 2.16.2
libtiledbsoma version() libtiledb=2.16.2
python version 3.10.12.final.0
OS version Darwin 22.5.0
The standard tiledb
Python package provides access to TileDB’s virtual filesystem (VFS), which allows you to interact with data on local disk, S3, and TileDB Cloud using the same API. We’ll use it below to read data from an H5AD file stored in S3.
cfg = tiledb.Config({"vfs.s3.no_sign_request": True})
vfs = tiledb.VFS(config=cfg)
Using TileDB’s VFS we can read the H5AD directly from S3 and load it into memory using the AnnData package:
H5AD_URI = "s3://tiledb-inc-demo-data/singlecell/h5ad/pbmc3k_processed.h5ad"
with vfs.open(H5AD_URI) as h5ad:
adata = ad.read_h5ad(h5ad)
adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Inspecting the adata
object, we can see that in addition to the expression data, cell-level annotations in obs
, and feature-level annotations in var
, it also contains analysis results in obsm
, varm
, obsp
, and uns
. All of these components can be ingested into a SOMA experiment by passing the anndata
object to the tiledbsoma.io.from_anndata
.
The experiment_uri
argument is a URI that points to the location where the SOMA experiment will be created. In this case, we’re using a local directory, but it could just as easily be an S3 bucket or a TileDB Cloud URI.
tiledbsoma.io.from_anndata(
experiment_uri="soma-exp-pbmc3k",
measurement_name="RNA",
anndata=adata
)
'soma-exp-pbmc3k'
The file path "soma-exp-pbmc3k"
now points to a local directory containing the SOMA experiment, which itself is a collection of TileDB groups and arrays, each of which contains a different component of the original dataset, organized according to the SOMA data model.
We can open the new SOMA experiment in read mode to view its structure:
experiment = tiledbsoma.Experiment.open("soma-exp-pbmc3k")
experiment
<Experiment 'soma-exp-pbmc3k' (open for 'r') (2 items)
'ms': 'soma-exp-pbmc3k/ms' (unopened)
'obs': 'soma-exp-pbmc3k/obs' (unopened)>
Note that opening a SOMAExperiment
(or any SOMA object) only returns a pointer to the object on disk. No data is actually loaded into memory until it’s requested.
The top-level of the experiment contains 2 elements: obs
, a SOMA DataFrame
containing the cell annotations, and ms
, a SOMA Collection
of the measurements (e.g., RNA) in the experiment.
We can access the obs
array directly with:
experiment.obs
<DataFrame 'soma-exp-pbmc3k/obs' (open for 'r')>
Other elements are nested within the experiment according to the SOMA data model but can be accessed in a similar way. For example, feature-level annotations are stored in the var
array, which is located at the top-level of each SOMA Measurement
. For this dataset, we have a single measurement, RNA
, whose var
we can access with:
experiment.ms["RNA"].var
<DataFrame 'soma-exp-pbmc3k/ms/RNA/var' (open for 'r')>
All SOMA objects provide a read
method for loading the data into memory. Designed with large datasets in mind, these methods always return an iterator, allowing data to be loaded in chunks intelligently sized by TileDB to accommodate the available memory, and efficiently materialize the results as Arrow Tables, leveraging zero-copy memory sharing where possible.
In this case, since we’re looking at a smaller dataset, we’re using the concat
method to automatically concatenate the results into a single Arrow Table before converting to a Pandas DataFrame:
experiment.obs.read().concat().to_pandas()
soma_joinid obs_id ... n_counts louvain
0 0 AAACATACAACCAC-1 ... 2419.0 CD4 T cells
1 1 AAACATTGAGCTAC-1 ... 4903.0 B cells
2 2 AAACATTGATCAGC-1 ... 3147.0 CD4 T cells
3 3 AAACCGTGCTTCCG-1 ... 2639.0 CD14+ Monocytes
4 4 AAACCGTGTATGCG-1 ... 980.0 NK cells
... ... ... ... ... ...
2633 2633 TTTCGAACTCTCAT-1 ... 3459.0 CD14+ Monocytes
2634 2634 TTTCTACTGAGGCA-1 ... 3443.0 B cells
2635 2635 TTTCTACTTCCTCG-1 ... 1684.0 B cells
2636 2636 TTTGCATGAGAGGC-1 ... 1022.0 B cells
2637 2637 TTTGCATGCCTCAC-1 ... 1984.0 CD4 T cells
[2638 rows x 6 columns]
One of the most useful features of SOMA is the ability to efficiently filter and select only the data necessary for your analysis without loading the entire dataset into memory first. The read
methods offer several arguments to access a specific subset of data:
coords
to slice by coordinates from the underlying TileDB array’s indexed dimension(s)value_filter
to filter by values using a query stringcolumn_names
to select the specific columns that should be returnedIn this example we’re loading in the first 100 records from obs
with at least 2,000 detected reads, and only accessing two columns of interest.
experiment.obs.read(
coords=[slice(0, 99)],
value_filter="n_counts > 2000",
column_names=["obs_id", "n_counts"],
).concat().to_pandas()
obs_id n_counts
0 AAACATACAACCAC-1 2419.0
1 AAACATTGAGCTAC-1 4903.0
2 AAACATTGATCAGC-1 3147.0
3 AAACCGTGCTTCCG-1 2639.0
4 AAACGCACTGGTAC-1 2163.0
.. ... ...
68 AAGATGGAGATAAG-1 4019.0
69 AAGATTACAACCTG-1 2117.0
70 AAGATTACAGATCC-1 4058.0
71 AAGATTACCCGTTC-1 2762.0
72 AAGATTACCGCCTT-1 3062.0
[73 rows x 2 columns]
The real power of the SOMA API comes from the ability to slice and filter measurement data based on the cell- and feature-level annotations stored in the experiment. For datasets containing millions of cells, this means you can easily access expression values for cells within a specific cluster, or that meet a certain quality threshold, etc.
In this example we’re interested in cells assigned to a particular Louvain cluster where mitochondrial DNA makes up less than 5% of the total reads. We’re also only interested in genes detected in at least 100 cells.
query = experiment.axis_query(
measurement_name="RNA",
obs_query=tiledbsoma.AxisQuery(
value_filter="louvain == 'CD4 T cells' and percent_mito < 0.02",
),
var_query=tiledbsoma.AxisQuery(
value_filter="n_cells > 100",
),
)
The returned query
object allows us to inspect the query results and selectively access the data we’re interested in. For example, we can see how many cells and genes were returned by the query:
{"cells": query.n_obs, "genes": query.n_vars}
{'cells': 695, 'genes': 1035}
We can also load the expression data into memory for the selected cells and genes as an Arrow Sparse Tensor.
query.X(layer_name="data").coos().concat()
<pyarrow.SparseCOOTensor>
type: float
shape: (2147483646, 2147483646)
Note
Note the
shape
of the returned tensor corresponds to the capacity of the underlying TileDB array. By default, SOMA creates arrays with plenty of room for adding new data down the road.
Importantly, we can also materialize all of the relevant query results as a new AnnData
object:
pbmc3k = query.to_anndata(X_name="data")
pbmc3k
AnnData object with n_obs × n_vars = 695 × 1035
obs: 'soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'soma_joinid', 'var_id', 'n_cells'
And with this, we can now leverage the full suite of analysis and visualization methods provided by Scanpy.
sc.pl.violin(pbmc3k, ['n_counts', 'n_genes', 'percent_mito'], jitter=0.4, multi_panel=True)
As mentioned above, the SOMA API is designed to be language-agnostic, which means that the same SOMA objects can be accessed using the Python or R tiledbsoma packages. Now we’ll demonstrate how to access the same SOMA experiment we created above using the R package.
The tiledbsoma
R package will be available on CRAN shortly. In the meantime, it’s currently available from both R-universe and Conda.
For simplicity’s sake we’ll continue using Conda for this tutorial. With the tiledbsoma-101
environment activated, we can install the required packages with the following command:
mamba install \
-c tiledb -c bioconda -c conda-forge \
r-tiledbsoma r-seurat bioconductor-singlecellexperiment
Let’s load the R packages we’ll be using.
library(tiledb)
library(tiledbsoma)
suppressPackageStartupMessages(library(Seurat))
show_package_versions()
tiledbsoma: 1.4.0
tiledb-r: 0.20.3
tiledb core: 2.16.2
libtiledbsoma: libtiledb=2.16.2
R: R version 4.2.3 (2023-03-15)
OS: macOS Big Sur ... 10.16
We can open the SOMA experiment in read mode to view its structure:
experiment <- SOMAExperimentOpen("soma-exp-pbmc3k")
experiment
<SOMAExperiment>
uri: soma-exp-pbmc3k
arrays: obs
groups: ms
We can access the obs
array directly with:
experiment$obs
<SOMADataFrame>
uri: soma-exp-pbmc3k/obs
dimensions: soma_joinid
attributes: obs_id, n_genes, percent_mito, n_counts, louvain
Let’s access the var
array within the experiment’s single RNA
measurement:
experiment$ms$get("RNA")$var
<SOMADataFrame>
uri: soma-exp-pbmc3k/ms/RNA/var
dimensions: soma_joinid
attributes: var_id, n_cells
Just like in Python, all R SOMA objects provide a read
method that returns an iterator for loading the data into memory in chunks. Here too we can call the concat()
method to automatically concatenate the results into a single Arrow Table before converting to a data frame.
experiment$obs$read()$concat()$to_data_frame()
# A tibble: 2,638 × 6 soma_joinid obs_id n_genes percent_mito n_counts louvain1 0 AAACATACAACCAC-1 781 0.0302 2419 CD4 T cells 2 1 AAACATTGAGCTAC-1 1352 0.0379 4903 B cells 3 2 AAACATTGATCAGC-1 1131 0.00890 3147 CD4 T cells 4 3 AAACCGTGCTTCCG-1 960 0.0174 2639 CD14+ Monocytes 5 4 AAACCGTGTATGCG-1 522 0.0122 980 NK cells 6 5 AAACGCACTGGTAC-1 782 0.0166 2163 CD8 T cells 7 6 AAACGCTGACCAGT-1 783 0.0382 2175 CD8 T cells 8 7 AAACGCTGGTTCTT-1 790 0.0310 2260 CD8 T cells 9 8 AAACGCTGTAGCCA-1 533 0.0118 1275 CD4 T cells 10 9 AAACGCTGTTTCTG-1 550 0.0290 1103 FCGR3A+ Monocytes # ℹ 2,628 more rows
We also have access to the same filtering capabilities in R:
experiment$obs$read(
coords = 0:99L,
value_filter = "n_counts > 2000",
column_names = c("obs_id", "n_counts")
)$concat()$to_data_frame()
# A tibble: 73 × 2 obs_id n_counts1 AAACATACAACCAC-1 2419 2 AAACATTGAGCTAC-1 4903 3 AAACATTGATCAGC-1 3147 4 AAACCGTGCTTCCG-1 2639 5 AAACGCACTGGTAC-1 2163 6 AAACGCTGACCAGT-1 2175 7 AAACGCTGGTTCTT-1 2260 8 AAACTTGAAAAACG-1 3914 9 AAACTTGATCCAGA-1 2388 10 AAAGAGACGAGATA-1 2410 # ℹ 63 more rows
Now we’ll construct our experiment-level query to filter the cells and genes according to the criteria detailed above:
query <- experiment$axis_query(
measurement_name = "RNA",
obs_query = SOMAAxisQuery$new(
value_filter = "louvain == 'CD4 T cells' & percent_mito < 0.02"
),
var_query = SOMAAxisQuery$new(
value_filter = "n_cells > 100"
)
)
And the returned R6 query
object provides access to the same suite of methods for accessing data or inspecting the results:
c("cells" = query$n_obs, "genes" = query$n_vars)
cells genes
695 1035
From here you could use the to_sparse_matrix()
method to easily load query results for any matrix-like data as a sparse dgTMatrix
(from the Matrix
package). At a minimum, you need to pass a collection (e.g., X
, or obsm
) and layer (e.g., data
). You can also populate the matrix dimension names by specifying which obs
column contains the values to use for row names and which var
column contains the values to use for column names.
mat <- query$to_sparse_matrix(
collection = "X",
layer_name = "data",
obs_index = "obs_id",
var_index = "var_id"
)
mat[1:10, 1:5]
10 x 5 sparse Matrix of class "dgTMatrix"
TNFRSF4 CPSF3L C1orf86 RER1 TNFRSF25
AAACATTGATCAGC-1 -0.3768875 -0.2950843 -0.5209721 1.3326473 -0.3093624
AAACGCTGTAGCCA-1 4.8617630 -0.2305487 2.8116059 -0.3710031 -0.2219476
AAACTTGATCCAGA-1 -0.3280211 -0.2693628 -0.4601668 1.7225345 -0.2724454
AAAGAGACGAGATA-1 -0.3297788 -0.2700860 -0.4619023 -0.5168150 -0.2735897
AAAGCCTGTATGCG-1 8.5038023 -0.2917450 -0.5083421 -0.5924059 -0.2851624
AAAGTTTGTAGAGA-1 -0.2356792 -0.2264081 -0.3578959 -0.3534963 -0.2078178
AAATCAACACCAGT-1 -0.2129427 -0.2397138 -0.3861018 -0.3995942 -0.2136028
AAATCAACGGAAGC-1 -0.3199450 -0.3112433 -0.5532420 1.1781110 -0.3082560
AAATCAACTCGCAA-1 -0.2838825 -0.2883788 -0.4996897 -0.5791501 5.4258475
AAATGTTGCCACAA-1 4.1355896 -0.2447071 -0.4057947 -0.4264639 -0.2530961
Using tiledbsoma
’s Seurat support, we can materialize the query results as a Seurat
object:
pbmc3k <- query$to_seurat(
X_layers = c(data = "data"),
obs_index = "obs_id",
var_index = "var_id"
)
pbmc3k
An object of class Seurat
1035 features across 695 samples within 1 assay
Active assay: RNA (1035 features, 0 variable features)
4 dimensional reductions calculated: umap, tsne, pca, draw_graph_fr
And with this, we can now leverage the full suite of analysis and visualization methods provided by Seurat.
Seurat::VlnPlot(pbmc3k, features = c("n_counts", "n_genes", "percent_mito"))
You can even create a Bioconductor SingleCellExperiment
object from the same query:
query$to_single_cell_experiment(
X_layers = c(logcounts = "data"),
obs_index = "obs_id",
var_index = "var_id"
)
class: SingleCellExperiment
dim: 1035 695
metadata(0):
assays(1): logcounts
rownames(1035): TNFRSF4 CPSF3L ... SUMO3 PRMT2
rowData names(1): n_cells
colnames(695): AAACATTGATCAGC-1 AAACGCTGTAGCCA-1 ... TTTCCAGAGGTGAG-1
TTTGCATGCCTCAC-1
colData names(4): n_genes percent_mito n_counts louvain
reducedDimNames(4): UMAP TSNE PCA DRAW_GRAPH_FR
mainExpName: RNA
altExpNames(0):
Everything we’ve shown in this tutorial has used the SOMA APIs. However, TileDB-SOMA’s interoperability goes beyond just R and Python because, as I mentioned earlier, a SOMA experiment is just a collection of TileDB groups and arrays. This means we can access the underlying data using any of TileDB’s APIs (albeit without the convenience methods provided by SOMA). The list of supported APIs is growing, but currently includes: C++, C, Python, R, Java, Go, C#, Javascript, and even SQL. This is a real boon for developers who want to build applications that leverage SOMA data.
By virtue of building on top of TileDB embedded, TileDB-SOMA automatically inherits the ability to work transparently with cloud object stores. What that means for you is SOMA experiments can be stored in S3 buckets and accessed in the same way as local files, without changing anything in your code other than the URI. Taken together, this provides a clear path forward for simpler data management: centralize your datasets in the cloud and access them from any client, allowing R and Python researchers to collaborate on the same shared datasets.
Of course, properly managing cloud infrastructure brings its own set of challenges, especially when it comes to data governance, security, and compute. To address this, we built TileDB Cloud, a SaaS product that offers critical data management and scalable compute features.
With TileDB Cloud, single-cell datasets can be ingested directly to an S3 bucket you own and automatically registered in TileDB Cloud’s data catalog, which provides powerful search and discovery capabilities. You can easily find your SOMA experiments by visiting the Life Sciences tab and then opening the SOMA view:
From there, you can narrow your search by filtering on the dataset name, description, metadata, or tags. But TileDB Cloud’s capabilities go far beyond data cataloging. It also provides:
Let’s take a look at some of these features in action. One of the single-cell datasets available in TileDB Cloud is the Tabula Sapiens atlas, which contains nearly half a million cells from 24 different tissues. SOMA experiments for each of the tissue-specific datasets as well as the four functional compartments are publicly available on TileDB Cloud. Below we’re showing the TileDB Cloud overview page for a SOMA experiment that contains the Tabula Sapiens immune compartment:
Every asset in TileDB Cloud has a similar overview page, which provides a host of useful information about the asset. In this case we can see the UUID-based tiledb://
URI that can be used to securely access the dataset, any tags that have been added for organization purposes, as well as a README-like description field that provides full Markdown support for documentation. From here we can also explore the contents of this dataset by clicking on the Contents button, which reveals that the SOMA experiment contains two root-level assets: a group called ms
(the collection of SOMA measurements) and an array called obs
(the data frame containing cell-level annotations). Clicking on obs
brings us to the overview page for this TileDB array. Here I’ve clicked on the Schema button to inspect the schema used to store the data:
You can also click Activity to see a detailed history of who accessed this array and when, which is useful for auditing purposes.
If you’d like to learn more about TileDB Cloud please see our previous post: Getting Started with TileDB Cloud in \< 4 Minutes.
The incredible progress of single-cell sequencing technologies has led to an explosion of data. This has created a growing need for scalable, interoperable, and accessible data management solutions. TileDB-SOMA is our proposed solution to these challenges, and we’re excited to share it with the community.
I hope you enjoyed this article covering the basics of the open-source Python and R TileDB-SOMA packages, as well as the benefits of upgrading your work with TileDB Cloud, including secure governance and sharing, easy discoverability, and a powerful serverless compute platform.
We’d love to hear what you think! Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.