In this article we provide a quickstart tutorial on the vector search capabilities of TileDB. I strongly recommend you read the blog "Why TileDB for Vector Search" before digging into this tutorial, especially if you are not familiar with the TileDB array database and how it naturally morphs into a vector database (vectors are 1D arrays after all).
TileDB’s core array technology lies in the open-source (MIT License) library TileDB-Embedded, but we developed the vector search specific components in library TileDB-Vector-Search, which is also open-source under MIT License. Similar to the core library, TileDB-Vector-Search is built in C++, but it also offers a Python API.
In the majority of this article, we will use the Python API of TileDB-Vector-Search, and the examples are reproducible on your local machine. However, TileDB also develops the commercial TileDB Cloud product, which provides you with additional governance and scalability features, which we cover briefly in one section. All the code in this article is summarized in a TileDB Cloud notebook, which you can either download and run locally, or launch a Jupyter server directly in TileDB Cloud. Sign up to do so (no credit card required), and we’ll give you free credits so that you can evaluate without a hassle.
To install TileDB-Vector-Search, run:
# Using pip
pip install tiledb-vector-search
# Or, using conda (requires tiledb-cloud and scikit-learn)
pip install tiledb-cloud
conda install -c conda-forge -c tiledb tiledb-vector-search scikit-learn
We’ll first work with the small (10k) SIFT dataset from the Datasets for approximate nearest neighbor search site. Download the dataset from here (mirrored, from original source here) and run:
tar xf siftsmall.tgz
Later we’ll play with the 1B SIFT dataset on TileDB Cloud, which we have pre-ingested for you, so nothing to do here.
In this example I will show you how to ingest the small dataset you just downloaded, and run your first similarity search query. For more information, see the TileDB-Vector-Search API Reference.
We start by importing the necessary libraries:
import tiledb.vector_search as vs
from tiledb.vector_search.utils import *
import sklearn
You can ingest vectors with a single command as follows.
flat_index = vs.ingest(
index_type = "FLAT",
array_uri = "sift10k_flat",
source_uri = "siftsmall_base.fvecs",
source_type = "FVEC",
partitions=100
)
That will create a “vector search asset” which is equivalent to a “TileDB group” called sift10k_flat
in your working directory. If we list its contents, we’ll find a single 2D dense array called shuffled_vectors
, which is storing all the ingested vectors from file siftsmall_base.fvecs
.
%%bash
ls -al sift10k_flat # List the group contents
total 0
drwxr-xr-x 6 stavrospapadopoulos staff 192 Jul 31 23:29 .
drwxr-xr-x 11 stavrospapadopoulos staff 352 Jul 31 23:30 ..
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __group
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __meta
-rw-r--r-- 1 stavrospapadopoulos staff 0 Jul 31 23:29 __tiledb_group.tdb
drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:29 shuffled_vectors
%%bash
ls -l sift10k_flat/shuffled_vectors # List the array contents
total 0
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __commits
drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __fragment_meta
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __fragments
drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __labels
drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __meta
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __schema
# Open the array
A = tiledb.open("sift10k_flat/shuffled_vectors")
# Print the schema - 2D dense array
print(A.schema)
ArraySchema(
domain=Domain(*[
Dim(name='rows', domain=(0, 127), tile=128, dtype='int32'),
Dim(name='cols', domain=(0, 9999), tile=100, dtype='int32'),
]),
attrs=[
Attr(name='values', dtype='float32', var=False, nullable=False),
],
cell_order='col-major',
tile_order='col-major',
capacity=12800,
sparse=False,
)
# Print the first vector
print(A[:,0]["values"])
[ 0. 16. 35. … 8. 19. 25. 23. 1.]
Parameter partitions
dictates the tiling of this array, but you can ignore it for now (we’ll discuss it at length in a separate blog). This vector search asset has no indexing (index_type = FLAT
). Therefore, running this ingestion function is rapid, similarity search (for large datasets) is slow as it is brute-force, and the recall (i.e., accuracy) is always 100%.
To ingest the dataset building an IVF_FLAT
index, all you need to do it specify index_type = "IVF_FLAT"
:
ivf_flat_index = vs.ingest(
index_type="IVF_FLAT",
source_uri="siftsmall_base.fvecs",
array_uri="sift10k_ivf_flat",
source_type = "FVEC",
partitions = 100
)
This takes a longer time to run, similarity search is much faster even for humongous datasets, but the recall may be smaller than 100%.
Looking into the contents of the created group sift10k_ivf_flat
, we can now see that there are more arrays created (partition_centroids
, shuffled_vector_ids
and partition_indexes
), which collectively comprise the IVF_FLAT
index.
%%bash
ls -l sift10k_ivf_flat
total 0
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:28 __group
drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:28 __meta
-rw-r--r-- 1 stavrospapadopoulos staff 0 Jul 31 23:28 __tiledb_group.tdb
drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 partition_centroids
drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 shuffled_vector_ids
drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 partition_indexes
drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 shuffled_vectors
You can explore the schema of those arrays and read their contents as you’d do with any other TileDB array. We will cover the TileDB-Vector-Search internals in detail in future blogs. Our blog post TileDB 101: Arrays can familiarize you with TileDB arrays.
To run similarity search on the ingested vectors, we’ll load the queries and ground truth vectors from the siftsmall
dataset we downloaded (noting though you can use any vector to query this dataset). We provide load_fvecs and load_ivecs as auxiliary functions in the tiledb.vector_search.utils
module.
# Get query vectors with ground truth
query_vectors = load_fvecs("siftsmall_query.fvecs")
ground_truth = load_ivecs("siftsmall_groundtruth.ivecs")
To return the most similar vectors to a query vector, simply run:
# Select a query vector
query_id = 77
qv = np.array([query_vectors[query_id]])
# Return the 100 most similar vectors to the query vector with FLAT
result = flat_index.query(qv, k=100)
# Return the 100 most similar vectors to the query vector with IVF_FLAT
# (you can set the nprobe parameter)
#result = ivf_flat_index.query(qv, nprobe = 10, k=100)
To check the result against the ground truth, run:
# For FLAT, the following will always be true
np.alltrue(result == ground_truth[query_id])
You can even run batches of searches, which are very efficiently implemented in TileDB-Vector-Search:
# Simply provide more than one query vectors
result = ivf_flat_index.query(np.array([query_vectors[5], query_vectors[6]]), nprobe=2, k=100)
result
array([[1097, 1239, 3227, 804, …, 849, 9262],
[3013, 1682, 8581, 2774, …, 9694, 9704]], dtype=uint64)
To query a vector search asset in a later session, you simply need to run the following command to initiate the index for queries:
index = vs.IVFFlatIndex(uri)
query_id = 77
result = index.query(np.array([query_vectors[query_id]]), k=10)
TileDB natively supports several of the most widely-used Cloud object stores with no additional dependencies. For example, if you have configured your AWS S3 account with default credentials in HOME/.aws
, then TileDB may be used to read and write directly from S3 with no additional configuration as follows:
data_dir = #<where your source vector file resides>
output_uri = "s3://tiledb-isaiah2/vector_search/sift10k_flat"
index = vs.ingest(
index_type = "IVF_FLAT",
array_uri = output_uri,
Ssource_uri = os.path.join(data_dir, "siftsmall_base.fvecs"),
source_type = "FVEC",
partitions = 100
)
For more information on TileDB’s cloud object storage support, see blog TileDB 101: Cloud Object Storage. For additional configuration and authentication options for AWS S3, Azure, and GCS, see the following documentation:
If you wish to boost the vector search performance and enjoy some important data management features, you can use TileDB-Vector-Search on TileDB Cloud (sign up and we will give you free credits for your trial). In this section I will describe how to perform serverless, distributed ingestion and vector search using TileDB Cloud’s task graphs. I will also cover the exciting data governance functionality of TileDB Cloud, which allows you to securely share your assets with other users, discover other users’ public work and log every action for auditing purposes.
If you are interested in delving into the general functionality of TileDB Cloud, you can read the following blog posts:
Distributed ingestion can greatly speed up and horizontally scale out the ingestion of vectors. In this example, we've ingested the SIFT 1 Billion vector dataset using the IVF_FLAT
index (which involved computing K-means, a computationally intensive operation) in 46 minutes for a total cost of $11.385 in TileDB Cloud.
To ingest in a distributed, parallel fashion, simply set the mode
to "BATCH
" for batch ingestion, and pass a tiledb.cloud.Config()
parameter with your TileDB credentials:
import tiledb
import tiledb.cloud
import tiledb.vector_search as vs
output_uri = "tiledb://TileDB-Inc/s3://tiledb-exaxmple/vector-search/ann_sift1b"
source_uri = "tiledb://TileDB-Inc/6a9a8e97-d99c-4ddb-829a-8455c794906e"
vs.ingest(
index_type = "IVF_FLAT",
array_uri = output_uri,
source_uri = source_uri,
source_type = "TILEDB_ARRAY",
partitions = 10_000,
config = tiledb.cloud.Config(),
mode = vs.Mode.BATCH)
Note that ingest
automatically calculates the number of workers that will work in parallel to perform the ingestion, and there is no need to spin up any cluster - everything is serverless!
TileDB Cloud records all the details about the task graph, and lets you monitor it in real time.
Distributed queries give you the capability of performing higher throughput queries with lower latency, yielding a much higher QPS. For example, in the example below we are able to submit a batch of 1000 query vectors to the 1 billion vector dataset for a cost of $0.10 in 23 seconds.
Performing a distributed query is similar to the ingestion described above, but now you can set mode to REALTIME
as it will be faster:
# If running locally, log in to TileDB Cloud
#tiledb.cloud.login(token='your-api-key')
uri = "tiledb://TileDB-Inc/ann_sift1b"
query_vector_uri = "tiledb://TileDB-Inc/bigann_1b_ground_truth_query"
n_vectors=10000
index = vs.IVFFlatIndex(uri, config=tiledb.cloud.Config(), memory_budget=10)
query_vectors=np.transpose(vs.load_as_array(query_vector_uri, config=tiledb.cloud.Config()).astype(np.float32))[0:n_vectors]
results = index.query(query_vectors, k=10, nprobe=1, mode=tiledb.cloud.dag.Mode.REALTIME, num_partitions=30)
Vector search asset tiledb://TileDB-Inc/ann_sift1b
is public (see “Governance” section below) and, thus, the above code will “just work” if you try to use it with your credentials.
Here is the task graph output of the above query.
TileDB Cloud allows you to quickly explore any vector search datasets right in the UI. We have several public datasets available, including the original ANN SIFT 1Billion Raw Vectors. This array lets you query the original vectors and get the data back as numpy arrays without having to parse the ivecs file. Simply slice directly by vector id!
uri = "tiledb://TileDB-Inc/ann_sift1b_raw_vectors"
with tiledb.open(uri, 'r', config=tiledb.cloud.Config()) as A:
vector_id = 1
print(A[vector_id])
Additional datasets include the pre-ingested BigANN dataset (with FLAT
and IVF_FLAT
indexes), as well as the tensorflow flowers and open drone map datasets.
Let's take a look at the BigANN SIFT 1 Billion dataset. When you open up the dataset you can see the components, as well as the overview, sharing, and settings.
You can quickly share the dataset with any other user or organization on the sharing page.
Any dataset can also easily be made public right on the setting page.
Logging access and providing a full audit trail is built directly into TileDB Cloud. When you are sharing data with third parties or making data public, it is important to be able to capture access and understand what people are doing with the data. TileDB provides logs for all access including specific details about the data access and code used to perform that access.
This gives you just a small taste of the power of flexibility of TileDB Cloud. Start your own exploration today and do not hesitate to send us your feedback!
This article covered the very basics on TileDB’s vector search capabilities. Our team is working hard on multiple new exciting algorithms and features, so look out for the upcoming blogs on the TileDB-Vector-Search internals, benchmarks, integrations with LLMs, and more. You can also keep up with all other TileDB news by reading our blog.
We'd love to hear what you think of this article. Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.