In this article we provide a quickstart tutorial on the vector search capabilities of TileDB. I strongly recommend you read the blog "Why TileDB for Vector Search" before digging into this tutorial, especially if you are not familiar with the TileDB array database and how it naturally morphs into a vector database (vectors are 1D arrays after all).
TileDB’s core array technology lies in the open-source (MIT License) library TileDB-Embedded, but we developed the vector search specific components in library TileDB-Vector-Search, which is also open-source under MIT License. Similar to the core library, TileDB-Vector-Search is built in C++, but it also offers a Python API.
In the majority of this article, we will use the Python API of TileDB-Vector-Search, and the examples are reproducible on your local machine. However, TileDB also develops the commercial TileDB Cloud product, which provides you with additional governance and scalability features, which we cover briefly in one section. All the code in this article is summarized in a TileDB Cloud notebook, which you can either download and run locally, or launch a Jupyter server directly in TileDB Cloud. Sign up to do so (no credit card required), and we’ll give you free credits so that you can evaluate without a hassle.
To install TileDB-Vector-Search, run:
# Using pip pip install tiledb-vector-search # Or, using conda (requires tiledb-cloud and scikit-learn) pip install tiledb-cloud conda install -c conda-forge -c tiledb tiledb-vector-search scikit-learn
tar xf siftsmall.tgz
Later we’ll play with the 1B SIFT dataset on TileDB Cloud, which we have pre-ingested for you, so nothing to do here.
In this example I will show you how to ingest the small dataset you just downloaded, and run your first similarity search query. For more information, see the TileDB-Vector-Search API Reference.
We start by importing the necessary libraries:
import tiledb.vector_search as vs from tiledb.vector_search.utils import * import sklearn
You can ingest vectors with a single command as follows.
flat_index = vs.ingest( index_type = "FLAT", array_uri = "sift10k_flat", source_uri = "siftsmall_base.fvecs", source_type = "FVEC", partitions=100 )
That will create a “vector search asset” which is equivalent to a “TileDB group” called
sift10k_flat in your working directory. If we list its contents, we’ll find a single 2D dense array called
shuffled_vectors, which is storing all the ingested vectors from file
%%bash ls -al sift10k_flat # List the group contents
total 0 drwxr-xr-x 6 stavrospapadopoulos staff 192 Jul 31 23:29 . drwxr-xr-x 11 stavrospapadopoulos staff 352 Jul 31 23:30 .. drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __group drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __meta -rw-r--r-- 1 stavrospapadopoulos staff 0 Jul 31 23:29 __tiledb_group.tdb drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:29 shuffled_vectors
%%bash ls -l sift10k_flat/shuffled_vectors # List the array contents
total 0 drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __commits drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __fragment_meta drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __fragments drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __labels drwxr-xr-x 2 stavrospapadopoulos staff 64 Jul 31 23:29 __meta drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:29 __schema
# Open the array A = tiledb.open("sift10k_flat/shuffled_vectors") # Print the schema - 2D dense array print(A.schema)
ArraySchema( domain=Domain(*[ Dim(name='rows', domain=(0, 127), tile=128, dtype='int32'), Dim(name='cols', domain=(0, 9999), tile=100, dtype='int32'), ]), attrs=[ Attr(name='values', dtype='float32', var=False, nullable=False), ], cell_order='col-major', tile_order='col-major', capacity=12800, sparse=False, )
# Print the first vector print(A[:,0]["values"])
[ 0. 16. 35. … 8. 19. 25. 23. 1.]
partitions dictates the tiling of this array, but you can ignore it for now (we’ll discuss it at length in a separate blog). This vector search asset has no indexing (
index_type = FLAT). Therefore, running this ingestion function is rapid, similarity search (for large datasets) is slow as it is brute-force, and the recall (i.e., accuracy) is always 100%.
To ingest the dataset building an
IVF_FLAT index, all you need to do it specify
index_type = "IVF_FLAT":
ivf_flat_index = vs.ingest( index_type="IVF_FLAT", source_uri="siftsmall_base.fvecs", array_uri="sift10k_ivf_flat", source_type = "FVEC", partitions = 100 )
This takes a longer time to run, similarity search is much faster even for humongous datasets, but the recall may be smaller than 100%.
Looking into the contents of the created group
sift10k_ivf_flat, we can now see that there are more arrays created (
partition_indexes), which collectively comprise the
%%bash ls -l sift10k_ivf_flat
total 0 drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:28 __group drwxr-xr-x 3 stavrospapadopoulos staff 96 Jul 31 23:28 __meta -rw-r--r-- 1 stavrospapadopoulos staff 0 Jul 31 23:28 __tiledb_group.tdb drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 partition_centroids drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 shuffled_vector_ids drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 partition_indexes drwxr-xr-x 8 stavrospapadopoulos staff 256 Jul 31 23:28 shuffled_vectors
You can explore the schema of those arrays and read their contents as you’d do with any other TileDB array. We will cover the TileDB-Vector-Search internals in detail in future blogs. Our blog post TileDB 101: Arrays can familiarize you with TileDB arrays.
To run similarity search on the ingested vectors, we’ll load the queries and ground truth vectors from the
siftsmall dataset we downloaded (noting though you can use any vector to query this dataset). We provide load_fvecs and load_ivecs as auxiliary functions in the
# Get query vectors with ground truth query_vectors = load_fvecs("siftsmall_query.fvecs") ground_truth = load_ivecs("siftsmall_groundtruth.ivecs")
To return the most similar vectors to a query vector, simply run:
# Select a query vector query_id = 77 qv = np.array([query_vectors[query_id]]) # Return the 100 most similar vectors to the query vector with FLAT result = flat_index.query(qv, k=100) # Return the 100 most similar vectors to the query vector with IVF_FLAT # (you can set the nprobe parameter) #result = ivf_flat_index.query(qv, nprobe = 10, k=100)
To check the result against the ground truth, run:
# For FLAT, the following will always be true np.alltrue(result == ground_truth[query_id])
You can even run batches of searches, which are very efficiently implemented in TileDB-Vector-Search:
# Simply provide more than one query vectors result = ivf_flat_index.query(np.array([query_vectors, query_vectors]), nprobe=2, k=100) result
array([[1097, 1239, 3227, 804, …, 849, 9262], [3013, 1682, 8581, 2774, …, 9694, 9704]], dtype=uint64)
To query a vector search asset in a later session, you simply need to run the following command to initiate the index for queries:
index = vs.IVFFlatIndex(uri) query_id = 77 result = index.query(np.array([query_vectors[query_id]]), k=10)
TileDB natively supports several of the most widely-used Cloud object stores with no additional dependencies. For example, if you have configured your AWS S3 account with default credentials in
HOME/.aws, then TileDB may be used to read and write directly from S3 with no additional configuration as follows:
data_dir = #<where your source vector file resides> output_uri = "s3://tiledb-isaiah2/vector_search/sift10k_flat" index = vs.ingest( index_type = "IVF_FLAT", array_uri = output_uri, Ssource_uri = os.path.join(data_dir, "siftsmall_base.fvecs"), source_type = "FVEC", partitions = 100 )
For more information on TileDB’s cloud object storage support, see blog TileDB 101: Cloud Object Storage. For additional configuration and authentication options for AWS S3, Azure, and GCS, see the following documentation:
If you wish to boost the vector search performance and enjoy some important data management features, you can use TileDB-Vector-Search on TileDB Cloud (sign up and we will give you free credits for your trial). In this section I will describe how to perform serverless, distributed ingestion and vector search using TileDB Cloud’s task graphs. I will also cover the exciting data governance functionality of TileDB Cloud, which allows you to securely share your assets with other users, discover other users’ public work and log every action for auditing purposes.
If you are interested in delving into the general functionality of TileDB Cloud, you can read the following blog posts:
Distributed ingestion can greatly speed up and horizontally scale out the ingestion of vectors. In this example, we've ingested the SIFT 1 Billion vector dataset using the
IVF_FLAT index (which involved computing K-means, a computationally intensive operation) in 46 minutes for a total cost of $11.385 in TileDB Cloud.
To ingest in a distributed, parallel fashion, simply set the
mode to "
BATCH" for batch ingestion, and pass a
tiledb.cloud.Config() parameter with your TileDB credentials:
import tiledb import tiledb.cloud import tiledb.vector_search as vs output_uri = "tiledb://TileDB-Inc/s3://tiledb-exaxmple/vector-search/ann_sift1b" source_uri = "tiledb://TileDB-Inc/6a9a8e97-d99c-4ddb-829a-8455c794906e" vs.ingest( index_type = "IVF_FLAT", array_uri = output_uri, source_uri = source_uri, source_type = "TILEDB_ARRAY", partitions = 10_000, config = tiledb.cloud.Config(), mode = vs.Mode.BATCH)
ingest automatically calculates the number of workers that will work in parallel to perform the ingestion, and there is no need to spin up any cluster - everything is serverless!
TileDB Cloud records all the details about the task graph, and lets you monitor it in real time.
Distributed queries give you the capability of performing higher throughput queries with lower latency, yielding a much higher QPS. For example, in the example below we are able to submit a batch of 1000 query vectors to the 1 billion vector dataset for a cost of $0.10 in 23 seconds.
Performing a distributed query is similar to the ingestion described above, but now you can set mode to
REALTIME as it will be faster:
# If running locally, log in to TileDB Cloud #tiledb.cloud.login(token='your-api-key') uri = "tiledb://TileDB-Inc/ann_sift1b" query_vector_uri = "tiledb://TileDB-Inc/bigann_1b_ground_truth_query" n_vectors=10000 index = vs.IVFFlatIndex(uri, config=tiledb.cloud.Config(), memory_budget=10) query_vectors=np.transpose(vs.load_as_array(query_vector_uri, config=tiledb.cloud.Config()).astype(np.float32))[0:n_vectors] results = index.query(query_vectors, k=10, nprobe=1, mode=tiledb.cloud.dag.Mode.REALTIME, num_partitions=30)
Vector search asset
tiledb://TileDB-Inc/ann_sift1b is public (see “Governance” section below) and, thus, the above code will “just work” if you try to use it with your credentials.
Here is the task graph output of the above query.
TileDB Cloud allows you to quickly explore any vector search datasets right in the UI. We have several public datasets available, including the original ANN SIFT 1Billion Raw Vectors. This array lets you query the original vectors and get the data back as numpy arrays without having to parse the ivecs file. Simply slice directly by vector id!
uri = "tiledb://TileDB-Inc/ann_sift1b_raw_vectors" with tiledb.open(uri, 'r', config=tiledb.cloud.Config()) as A: vector_id = 1 print(A[vector_id])
Let's take a look at the BigANN SIFT 1 Billion dataset. When you open up the dataset you can see the components, as well as the overview, sharing, and settings.
You can quickly share the dataset with any other user or organization on the sharing page.
Any dataset can also easily be made public right on the setting page.
Logging access and providing a full audit trail is built directly into TileDB Cloud. When you are sharing data with third parties or making data public, it is important to be able to capture access and understand what people are doing with the data. TileDB provides logs for all access including specific details about the data access and code used to perform that access.
This gives you just a small taste of the power of flexibility of TileDB Cloud. Start your own exploration today and do not hesitate to send us your feedback!
This article covered the very basics on TileDB’s vector search capabilities. Our team is working hard on multiple new exciting algorithms and features, so look out for the upcoming blogs on the TileDB-Vector-Search internals, benchmarks, integrations with LLMs, and more. You can also keep up with all other TileDB news by reading our blog.