Why TileDB as a Vector Database

Table Of Contents:

What is TileDB?

Why TileDB?

Technology

The Product

Specialized Solutions

Extensibility & Future-proofing

A Little History

Why "DB" & the Vision

Is TileDB for You?

What’s Next?

I was writing the preface of our new docs (yes, I am writing a lot of our docs - it helps me test every corner of our product), and I realized a couple of things. First, we have come a long way as a company. We are finally at a point where we see our audacious vision being realized. In retrospect, it was very difficult to communicate this vision a few years ago, and explain how earth shattering it would come to be for data management. But now our customers are seeing the benefits and embracing our vision. Second, we built a lot of product, with most of the capabilities and performance feats still not released fully to the entire community. We have been heads down on building a powerful team and working on extremely challenging use cases with very high-profile customers. Naturally, our docs and tutorials fell behind. Therefore, I thought to take this as an opportunity and give you a refresher on what TileDB has become, and why we built it in the first place.

If you don’t already know about TileDB, you are about to embark on an exciting new journey in data management. You are going to learn about a very disruptive technology that departs from all traditional approaches that you might be familiar with, and our vision to tidy up the data mess once and for all. At the same time, this blog post and our upcoming docs will reveal that embracing this disruption does not need to happen overnight; it can instead easily manifest in a gradual way across your organization. You can start with your simplest or most challenging data pain point, and gracefully expand as your data needs grow and your company data culture evolves. Buckle up and enjoy!

What is TileDB?

"TileDB is a universal database (or universal data platform), which aims at unifying all types of data and associated code, along with the often complex infrastructure surrounding those assets, into a single solution."

The long forgotten past: Life used to be much simpler a while ago when “data” just meant “tables”. Organizations were modeling and storing their data in tabular forms, and they were using SQL for access and analysis. In the majority of cases, organizations chose an enterprise grade database management system (from Oracle, IBM or Microsoft), and the typical users of this system were the database administrators (DBAs). Over the past five decades, a tremendous amount of sophistication has been built around SQL engines and other dataframe solutions, and the problem of tabular data management has been more or less solved by a lot of very smart people.

The advent of cloud and the “modern” data stack: Today, the landscape is very different. Organizations have extremely diverse data, which cannot be naturally and efficiently modeled as tables (such as images, genomics, point clouds, weather, flat files, and more). In addition, the needs for advanced analytics have evolved a lot. Machine Learning and AI proliferated, open-source tools and new data formats exploded, hence numerous special-purpose data solutions came up, along with a plethora of dev ops, ML ops and visualization software. And of course, the cloud happened, which introduced yet another layer of peculiarity. This is because now storage and compute are separated for scaling and economical reasons, and object stores became prevalent, bringing new constraints to heed.

Database management systems (both transactional / OLTP and analytical / warehouses) are now just one piece of the data infrastructure puzzle in organizations, serving a very small aspect of the challenging data problems organizations are facing. The end result is that organizations today are spending an inordinate amount of money and effort to build large data engineering teams in-house, who are trying to put together absolutely disparate tools and data types, harness the overwhelming data volumes, and derive valuable insights from the data. And although all the infrastructures we have seen share a stunning amount of similarity, they are all ad hoc, serving a very specific and immediate need, and crashing to the ground when this need changes (e.g., a new type of data or analysis emerges).

Time to redefine the “database”: TileDB aims at solving this problem at its root. Instead of trying to tame the numerous data types and computational needs with a thousand different variations of a data system, we are developing a singleplatform, the one data infrastructure. And we do so by going back to the roots of database management systems and identifying what was inadequate for handling the new needs of the users (now called data scientists) in the first place.

TileDB unifies all types of data with a powerful, universal data structure, called the multi-dimensional array. Arrays can shape-shift to efficiently store and process any kind of data, from tables, to images, genomics, weather, flat files and anything else you can think of. And as a true database, it does so while offering governance (authentication, access control, logging), scalable compute, and modern functionality such as ML, Jupyter notebooks, dashboards, global-scaling code and data sharing and more.

This sounds like a crazy, almost impossible, mission. You'll be surprised with what is truly possible and what our team has accomplished over the past couple of years. Read on!

Why TileDB?

This universality and "one data infrastructure" vision sounds cool. But why should I care?

The value proposition of TileDB can be summarized as follows:

No data silos: When different groups within an organization work on different data types and purchase or build completely disparate tools, data silos are created. This means that it is very difficult to gain access to data across an organization, and gain insights into user activity (think cyber attacks) and costs (think heart attacks when AWS sends you an invoice). TileDB eliminates silos, by offering a unified solution, where users from different groups (even different organizations around the globe) can autonomously own, yet globally discover and share data and code, securely collaborating and exchanging insights. In addition, IT and leadership in the organization can have a holistic view over all the assets, monitor user activity and properly manage all storage and compute costs. Finally, TileDB's universality enables an extensible infrastructure, preventing future silos from being formed.
Lower cost: The unified TileDB solution obviates the need for buying and maintaining numerous others (typically requiring large data engineering teams with hard-to-find skills and expertise), dramatically reducing the total cost of operation. And of course, the elephant in the room: performance. TileDB is efficient (in fact, very efficient and for a reason), especially for challenging specialized use cases where the data is massive, coming in obscure non-tabular formats, and requiring domain expertise to model and prepare it appropriately for efficient use. A more efficient solution means much lower consumption of storage and compute resources on the cloud.
Higher productivity: The lives of data analysts or scientists become very difficult in three cases: (1) performance is bad and they wait too long to gain insights from their data, (2) they don't get to use the tools they are familiar and comfortable with, thus they don't maximize their daily work output, and (3) it is hard to collaborate with other colleagues or external parties, as they have less information at their disposal. TileDB eliminates all these issues and delivers high productivity, through its performance, extreme interoperability and easy data sharing.

TileDB delivers the above benefits with cutting-edge open-source and commercial technology, all built around the concept of multi-dimensional arrays.

Technology

TileDB starts with multi-dimensional arrays, which are first-class citizens in the entire system. A table is an array, an image is an array, a file is an array, even a Jupyter notebook and a user-defined function are arrays in TileDB. By modeling all data and code as arrays, we are able to build a single storage engine (offering all the good stuff you'd expect, such as optimized compression, IO, versioning, time traveling and a lot more), a single compute layer, a single catalog layer, and a single secure governance layer (including authentication, access control and logging). And we were able to do all of this while leveraging the impeccable performance of arrays. Arrays are performant, because they can shape-shift amazingly to all the various data types, in a way that they can always maximize performance.

The TileDB technology is divided into:

Open-source: We are building a growing number of open-source set of software. The array storage engine (TileDB Embedded), which is the core of TileDB, along with its associate data format, is open-source. On top of the engine, we built numerous language API wrappers, and integrations with SQL engines and a wide range of popular data science, machine learning and application-specific tools.
Commercial: In addition to our open-source work, a ton of amazing engineering feats and capabilities are now in production as part of TileDB Cloud, the universal data platform. This is a full-fledged multi-tenant data management product which allows you to catalog your data and code, spin up Jupyter notebooks and assets -> and dashboards, and perform scalable compute in an economical serverless manner, with security and governance built-in.

Based on this technology, our team managed to build a powerful general product, as well as a suite of specialized solutions.

The Product

What is TileDB selling?

TileDB envisions to offer the one data infrastructure to both individual users and organizations. The main, “general” product that lets you achieve this is the TileDB Cloud universal data platform, along with the entire open-source ecosystem that plugs into it, and the application domain expertise of our team. The product seamlessly combines secure management of all data types, data science environments, ML workloads, visualizations, and pretty much everything any organization would need in order to tame the diversity and volume of their data, and extract valuable insights to reach their scientific and/or business goals. TileDB Cloud and its open-source packages come either as a single SaaS offering, or it can be deployed in any distributed environment of your choice, entirely under your control.

Moving your entire data infrastructure to TileDB may sound bold and risky, but we have good news: you do not need to migrate to TileDB overnight. You can always start with one data type (perhaps the one that causes you the biggest pain at the moment), and gradually add more data types and utilize more compute resources as your needs scale. In addition, due to its universality, the TileDB solution is extendible, in the sense that it will be able to accommodate your future data and compute needs, even those you currently cannot anticipate.

In addition to this general product, our team is building specialized solutions (always utilizing the universal TileDB technology), focusing on particularly important application domains with very challenging data problems.

Specialized Solutions

Why build and sell specialized solutions?

TileDB as the one data infrastructure is powerful and our early customers from the most challenging of data verticals understood our secret sauce - arrays, solid engineering and domain expertise. However, the challenge when you try to build a solution for your needs on a general platform is time, effort, skills and resources in data engineering. Business analysts and data scientists are looking for solutions to immediate pains, and prefer battle tested software that works with minimal effort. Therefore, at TileDB, it was natural that we would take on developing a suite of specialized solutions ourselves, with two main reasons in mind:

We love to build solutions with the TileDB technology, contributing our own little stone to important application domains and initiatives (such as saving babies’ lives). And since, naturally, we are experts at using TileDB, it is much easier and economical for us to build those solutions that can serve an entire vertical.
We wish to validate the universality of TileDB, and lead the way by demonstrating how TileDB can be leveraged to solve all data problems, from the simplest to the most sophisticated and challenging.

Currently, we focus a lot on Life Sciences (population genomics, single-cell and biomedical imaging) and Geospatial (point clouds and geospatial imaging), but we are also actively working on use cases in Telecommunications, Finance and Weather (coming up in the docs soon).

We have a very specific recipe for working on these specialized domains, which has worked for us extremely well over the years, summarized as follows:

We hire domain experts with advanced software engineering skills. These experts have a deep understanding of the specialized problems we use TileDB to solve, and they are natural users of our solutions.
We collaborate directly with luminaries in the domains we focus on. We have been extremely fortunate to work with amazing customers and partners that helped us shape and mature our product.
We take all the valuable insights from working with high-profile customers on challenging data use cases, and translate them into features in the

general

TileDB offering (such as performance optimizations, new visualizers for productivity, advanced access control /and many more), which can readily serve all other application domains.

Extensibility & Future-proofing

"One of the most appealing features of any universal solution is the fact that, in addition to all your current data needs, it can accommodate also your future needs, even those that you cannot anticipate at the moment. A universal database is extendible and future-proof."

Our team put a lot of effort into building a powerful, universal foundation in terms of storage (via multi-dimensional arrays), compute (via serverless task graphs) and interoperability (via numerous APIs and integrations). We also proved that such a foundation can be used to efficiently capture extremely challenging data use cases from Life Sciences, Geospatial and more. There is absolutely nothing preventing you from utilizing the TileDB general foundation and insights from the specialized solutions we have built, in order to easily and efficiently build your own specialized solutions and extend your one data infrastructure.

You can build your own data models using TileDB arrays, you can curate and publish your data on TileDB Cloud, you can create and share Jupyter notebooks, user-defined functions and sophisticated distributed algorithms via task graphs, you can design your own dashboards or you can integrate with existing visualization tools, you can train and share your own ML models, and many more. The sky is the limit when it comes to what you can build on top of the unified TileDB data platform. You don’t have to think and reinvent features like fast IO and compute, versioning, time traveling, access control, logging and other parts that should be the sole responsibility of a database system.

We are always eager to hear what you have built on top of TileDB, please drop us a note at [email protected] with your use case and brilliant solution.

A Little History

I started TileDB at MIT and Intel Labs in late 2014 as a research project that led to a VLDB 2017 paper. In May 2017 I spun it out into TileDB, Inc., a company that has since raised over $20M to further develop and maintain the project. Our investors include Two Bear Capital, Nexus Venture Partners, Intel Capital, Uncorrelated, Big Pi Ventures, Verizon Ventures, Lockheed Martin Ventures, Amgen Ventures and NTT DOCOMO Ventures.

The main motivation behind the TileDB research project was to investigate the hypothesis that we do not need to build a new database system every single time there is a new data type or compute environment twist. The question we tried to answer was: "is there a universal data model, along with a universal storage and compute engine, which can manage and analyze any type of data with a common, modular, and easily extendible code base?". Because, if there is, then we can build a universal database. After several years of hard work, we now have the confidence to claim that this universal data model is the multi-dimensional array, and that we have built the first truly universal database: TileDB!

Why "DB" & the Vision

Since the inception of the name "TileDB" in 2014 when the project was still in a research stage, I have been receiving the same recurring question: "TileDB does not sound or feel like a traditional database, so why do you call it a 'DB'?".

Indeed, the first open-source package I wrote, now the TileDB core called TileDB Embedded, was just a C++ storage library built on multi-dimensional arrays. It was not a typical “DB” system (as it was lacking SQL support and modules such as a parser, optimizer, executor, etc.). It was practically just one of the "DB" modules: the storage layer.

As our team began to grow, we started building numerous APIs on top of the C/C++ APIs of TileDB Embedded. That allowed us to develop integrations with data science tools in the Python and R ecosystems, ML tools (e.g., TensorFlow), SQL engines (e.g., MariaDB), and more. TileDB Embedded, along with its integrations, started to be able to do a lot of the things you would do with a database system, but it was still not a "DB" yet.

Subsequently, we built TileDB Cloud, which would offer some important "DB" functionality, such as authentication, access control, logging, and distributed computing. This started feeling a lot more like a "DB". However, several folks were noticing some exciting features (such as Jupyter notebooks, dashboards, machine learning capabilities, global-scale sharing and monetization), as well as add-ons (such as specialized capabilities for Life Sciences and Geospatial), all built-into this single data system. That was not what a typical "DB" offers in most people's minds, which are still stuck on transactions, warehouses and SQL.

As a result, the market is getting further confused as they are seeing similarities with new data infrastructure trends (like data mesh, data fabric, and the modern data stack), which are promoted by database analysts and marketeers as a combination of solutions "by definition". It is unfathomable for them that a single solution could deliver on what the organizations need for their data infrastructure.

So, why did I insist on the “DB” in TileDB? Because, since day one, my vision has been to absolutely disrupt the way people think about the “DB”. I don't believe that a “DB” should be just a small piece in your data infrastructure (like a warehouse, a file manager, or a domain-specific solution - all culprits of data silos, inordinate costs and hassle). I believe that the "DB" should be “the” data infrastructure. All other solutions (visualizers, observability tools, advanced AI tools, sophisticated distributed algorithms, etc.) can be important pieces plugging seamlessly into the "DB", and not disparate, ad hoc software with obscure APIs and formats that make data impossible to discover, govern, monitor and audit holistically and sanely.

If I could summarize the above in a single sentence, it would be this: I kept "DB" in TileDB because I think of TileDB as the evolution of the "DB", built on the shoulders of "DB" giants who merit paying tribute to, and not as yet another incremental twist with a random name that amplifies today's data noise.

Is TileDB for You?

In a nutshell, TileDB is for you if:

You are familiar with arrays (or “tensors” if you come from the ML world) and you need a fast array storage engine, either for your local machine, or for cloud object stores.
You have dataframes and are looking for a simple, scalable, serverless way to run SQL or make direct, efficient access via various languages (such as Python, R and many more).
You are looking to build a powerful data infrastructure in your organization that eliminates silos and consolidates all data assets/products in a unified way.
You experience performance and scalability challenges in certain application domains, such as Life Sciences (population genomics, single-cell, biomedical imaging) or Geospatial (point clouds, geospatial imaging).
You wish to curate and publish your datasets and/or reproducible code (Jupyter notebooks, user-defined functions, dashboards, distributed algorithms via task graphs) in the cloud with specific users, or with everyone globally.
You wish to monetize your data (without the need for building and maintaining any infrastructure) and code (by building powerful software suites on top of the TileDB Cloud serverless, distributed computing infrastructure).

On the other hand, TileDB is not for you if:

You are looking just for a transactional database for your website or similar needs (there are plenty amazing ones out there).
You are looking just for a data warehouse (TileDB is pretty awesome at that, but there are plenty amazing ones out there too).
You are not cool (just kidding, TileDB is for uncool kids as well).

Are you in doubt? Contact us and we'll help you figure it out.

What’s Next?

Our team is shipping new features at an incredible pace, while feverishly developing a ton of docs and tutorials that we will be publishing over the next few weeks. I will publish one more blog post before the end of the year on TileDB vs. the data mess (or is it “mesh”? Easy to confuse when you are Greek!). And then a series of announcements in early 2023. So, stay tuned!

If you liked what you read and wish to join an absolutely amazing team that made all the above possible, consider applying here or simply follow us on LinkedIn and Twitter.

Meet the authors

Stavros Papadopoulos

Founder and CEO, TileDB