There Is No Such Thing As Unstructured Data

Table Of Contents:

Why drug discovery requires a bigger perspective

How data science went wrong on “unstructured” data

How multidimensional arrays can embrace all types of data

Why drug discovery requires a bigger perspective

When you think of the biggest breakthroughs with machine learning in the last decade, what comes to mind? Large-language models generating answers, images and videos in seconds based on data scraped from the internet? Self-driving cars navigating the complex and chaotic streets of a big city? Google DeepMind’s Alphafold 3 predicting the structure of all of life’s molecules? As varied as all these remarkable breakthroughs are, they have one critical thing in common: They were all built on the back of unstructured data, which is data not placed in tabular databases.

Unstructured data includes everything from images, videos, music, voice recordings, satellite visualizations and all kinds of biology data—and it's the workhorse of machine learning and generative AI. Because this kind of data includes such a wide range, it’s little surprise nearly 90% of all data is unstructured.

But considering the immense value of this unstructured data, why are nearly 90% of data management tools only built to handle data that has been structured into the neat rows and columns of relational models?

In short, because it’s easier.

The 40+ year history of database technology was founded on relational tables, which made it too easy to label complex data as “unstructured” and essentially ignore it. But as the value of unstructured data became too clear to deny, attempts to manage it inside traditional databases are proving as tedious and frustrating as trying to fit a round peg in a square hole. This is a clear sign we need to change our approach and recognize there is a way to structure all “unstructured” data, making it highly performant and analysis-ready from the start.

At TileDB, we believe there is no such thing as unstructured data because multi-dimensional arrays can unlock its value in ways that traditional databases cannot.

How data science went wrong on “unstructured” data

If data hadn’t been structured in a relational model, then it fell in the catch-all category of “unstructured” because, if data experts didn’t know how data connected to other data or how to query it, they couldn’t use it. But just because they couldn’t use it didn’t mean that data lacked value. Data like PDFs, images and even complex multimodal data were labeled “unstructured” and stored in various clouds and drives until someone took the time to fit them into relational databases—and until then, this data was essentially invisible.

This was a huge mistake. All data complies with an algorithm, and therefore it has structure. Images have width and height and pixels and numbers for every color. Even documents like PDFs contain sentences, which follow rules of grammar and syntax. There is immense value in these data types, even if a relational table cannot understand them.

To call this data “unstructured” gives us permission to largely ignore it, leaving this data’s immense value underutilized. The sooner we put an end to the concept of “unstructured data,” the sooner we can take a more effective approach to using it. It is time to adopt a data structure that is more flexible than relational tables and can easily organize any data type you add to it.

How multidimensional arrays can embrace all types of data

We first recognized the potential of multidimensional data structures when our founder and CEO, Stavros Papadopoulous was studying geospatial and spatial temporal databases. Geometric and LIDAR data was far too complex for traditional databases, and he realized that this "frontier data” could only be modeled as multidimensional objects. Experimenting with multidimensional data structures, Papadopoulos found similarities in LIDAR data and genomics—in these completely separate fields, the data had problems that were algorithmically equivalent.

This discovery inspired our company’s work in using multidimensional arrays to unify data from all kinds of domains and types. Our vision is one of data universality, which holds that no data is unstructured when you have the technology to model it freely. Instead of trying to handle frontier data like omics data with a million different databases, we can build a multi-dimensional array that can flexibly expand to the scope of the data from clinical trials, biobanks and even data types no one has yet created. This approach enables significantly better performance with this complex data as well as with data managed in relational tables, delivering up to one hundred times better performance than traditional databases.

And this technology could not have arrived at a more important time. If we are going to solve paradigm-shifting data challenges like finding a cure for cancer in single-cell data, the drug discovery and precision medicine required will likely be found in frontier data, not relational tables. While it is possible relational tables will eventually stumble on a cure, the data engineering will be more expensive and take far longer—and countless patients should not have to pay that high price or wait endlessly.

To enable the healthier and more empowered future that frontier data offers, our industry cannot afford to waste more time clinging to the misconception of “unstructured data” and trying to solve it with dated tools. No matter how much we have invested in relational tables as our primary data science tool, we must recognize these tables will not deliver the innovation we need at the scale, speed, and cost efficiency we require. Instead, we need to expand our understanding through technology that recognizes the value of all data, no matter its complexity or source, and unlocks the insights that drive breakthroughs.

Meet the authors