TileDB Vector Search with LangChain

Table Of Contents:

LangChain

Setup

Retrieval Augmented Generation (RAG)

Conversation Memory

Stay Tuned

TileDB vector search is now available in LangChain, one of the most popular large language model (LLM) application development frameworks.

In this blog, we explore how to use LangChain and TileDB to implement powerful LLM applications. We present 2 examples that allow LLMs to:

Leverage external knowledge sources: Specifically, we initially use a vanilla LLM (ChatGPT-3.5-turbo) to ask a question about LangChain it does not know (since it was trained before 2021), and then we use the so-called RAG method to provide “private” data - the LangChain docs - stored and indexed as vectors in TileDB-Vector-Search, to answer any question about LangChain using ChatGPT.
Store and utilize user preferences and long-term interaction history to offer a personalized chat experience.

The methodology behind these simple examples can easily be used to build much more sophisticated LLM applications.

The examples in this article are summarized in a TileDB Cloud notebook, which you can either download and run locally, or run directly in TileDB Cloud. Sign up to do so (no credit card required), and we’ll give you free credits so that you can evaluate without hassle.

If you are not familiar with TileDB vector search and how it differentiates from other Vector databases, I strongly recommend reading the blogs: "Why TileDB for Vector Search" and “TileDB 101: Vector Search”.

LangChain

LangChain is a powerful software development framework that simplifies the process of building advanced large language model (LLM) applications. Some of the main concepts of LangChain are:

Chains: Chains go beyond a single LLM call and can be used to combine and execute a sequence of LLM calls and programmatic actions to perform a complex task.
Agents: Agents involve language models making decisions about which actions to take, taking those actions, observing the results, and repeating the process until done.
Retrieval Augmented Generation (RAG): RAG allows chains and agents to interact with external data sources, allowing them to understand and use data that were not available during their training (private, proprietary, new data etc.).
Memory: This allows applications to maintain the state of a chain or agent as well as storing and retrieving long-term user interaction history.

Let’s explore how you can leverage TileDB vector search to implement two of the main LangChain use cases, RAG and Memory.

Setup

In this blog we use Python. To install TileDB-Vector-Search, LangChain and the other required dependencies, run:

pip install "tiledb-vector-search>=0.0.18" "langchain>=0.0.331" openai tiktoken

We start by importing the necessary libraries:

import numpy as np
import os
import shutil
import time

from langchain.chains import ConversationalRetrievalChain, ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.txt import TextParser
from langchain.memory import VectorStoreRetrieverMemory
from langchain.prompts import PromptTemplate
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores.tiledb import TileDB

# remember to set your OpenAI api key here
os.environ["OPENAI_API_KEY"] = 'sk-...'

Retrieval Augmented Generation (RAG)

One of the limitations of LLM models is that their knowledge extends only to the data that were used during their training. Public training datasets are missing private and proprietary information required for enterprise applications. They are also missing information about the world and events that happened after the dataset creation time. This problem affects all types of LLMs public or proprietary, even those deployed and used locally (e.g., in sensitive enterprise applications).

In this example, we use TileDB vector search to allow the gpt-3.5-turbo model to answer questions about LangChain. Most ChatGPT models have limited world knowledge after 2021, the training data cutoff date. LangChain was created and became popular after 2021. Although we use ChatGPT 3.5, this example can be easily extended to other LLMs.

Asking public ChatGPT

Let’s start by asking ChatGPT some questions about LangChain.

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0
)
public_chatgpt = ConversationChain(llm=llm)
question = "What is langchain?"
print(f"User: {question}")
print(f"AI: {public_chatgpt.run(question)}")

User: What is langchain?
AI: I'm sorry, but I don't have any information about "langchain."
Could you provide more context or clarify what you're referring to?

Augment ChatGPT with LangChain documentation

We now use LangChain’s documentation to allow ChatGPT to answer questions about the project.

We first need to download the documentation from the project repo.

git clone https://github.com/langchain-ai/langchain.git

The next step is to index the documentation using TileDB’s vector indexing.

# Parse markdown documents and split them into text chunks
documentation_path = "./langchain/docs"
loader = GenericLoader.from_filesystem(
    documentation_path,
    glob="**/*",
    suffixes=[".mdx"],
    parser=TextParser()
)
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=1000,
    chunk_overlap=100
)
documents = loader.load()
print(f"Number of raw documents loaded: {len(documents)}")
documents = splitter.split_documents(documents)
documents = [d for d in documents if len(d.page_content) > 5]
texts = [d.page_content for d in  documents]
metadatas = [d.metadata for d in documents]
print(f"Number of document chunks: {len(texts)}")


# Generate embeddings for each document chunk
print("Generating embeddings...")
t1 = time.time()
embedding = OpenAIEmbeddings()
text_embeddings = embedding.embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
t2 = time.time()
print(f"Embeddings generated. Total time: {(t2-t1)}s")


# Index document chunks using a TileDB IVF_FLAT index
print("Indexing...")
tiledb_index_uri = "./tiledb_langchain_documentation_index"
if os.path.isdir(tiledb_index_uri):
    shutil.rmtree(tiledb_index_uri)
db = TileDB.from_embeddings(
    text_embedding_pairs, 
    embedding, 
    index_uri=tiledb_index_uri,
    index_type="IVF_FLAT",
    metadatas=metadatas)
t3 = time.time()
print(f"Indexing completed. Total time: {(t3-t2)}s")
print(f"Number of vector embeddings stored in TileDB-Vector-Search: {len(text_embeddings)}")

Number of raw documents loaded: 255
Number of document chunks: 1196
Generating embeddings...
Embeddings generated. Total time: 5.3482959270477295s
Indexing...
Indexing completed. Total time: 1.5287487506866455s
Number of vector embeddings stored in TileDB-Vector-Search: 1196

Let’s now ask for our augmented ChatGPT version. Observe that now ChatGPT returns relevant information about LangChain!

db = TileDB.load(index_uri=tiledb_index_uri, embedding=embedding)
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0
)
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)
private_chatgpt = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)

question = "What is langchain?"
print(f"User: {question}")
print(f"AI: {private_chatgpt.run({'question': question, 'chat_history': ''})}\n")

User: What is langchain?
AI: LangChain is a framework for developing applications powered by language
models. It allows developers to build context-aware and reasoning applications
by connecting a language model to various sources of context. LangChain consists
of several parts, including LangChain Packages (Python and JavaScript packages),
LangChain Templates (a collection of reference architectures), LangServe (a
library for deploying LangChain chains as a REST API), and LangSmith (a developer
platform for debugging and monitoring chains).

Conversation Memory

We are now going to use TileDB vector search to store the interaction history of a user with an LLM. This allows an LLM to remember past conversations and user preferences.

First, we need to create a TileDB vector index that will host the conversation history and add some data to it.

# Create a TileDB vector index to store the conversation history
tiledb_index_uri = "./tiledb_chat_history_index"
if os.path.isdir(tiledb_index_uri):
    shutil.rmtree(tiledb_index_uri)
embedding_size = 1536 # Dimensions of the OpenAIEmbeddings
TileDB.create(
    index_uri=tiledb_index_uri,
    index_type="IVF_FLAT", 
    dimensions=embedding_size, 
    vector_type=np.float32
)
vectorstore = TileDB.load(
    index_uri=tiledb_index_uri, 
    embedding=OpenAIEmbeddings()
)
retriever = vectorstore.as_retriever(search_kwargs=dict(k=2))
memory = VectorStoreRetrieverMemory(retriever=retriever)

# Add some conversation history
memory.save_context(
    {"input": "My name is Nikos"}, {"output": "Hello Nikos"})
memory.save_context(
    {"input": "My favorite food is pizza"}, {"output": "This is a classic choice"})
memory.save_context(
    {"input": "Blue is the best color"}, {"output": "Green is also nice"})

Let’s now use the TileDB vector index during a conversation with the user.

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
)
qa = ConversationChain(llm=llm, memory=memory)

question = "What is my name?"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")

question = "Are there any football teams with my favorite color in England?"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")

question = "Please suggest a recipe for my favorite food"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")

User: What is my name?
AI: Your name is Nikos.

User: Are there any football teams with my favorite color in England?
AI: Yes, there are several football teams in England with blue as their primary
color. One example is Chelsea Football Club, which is based in London and has
blue as their main color. Another example is Manchester City Football Club, also
known as Man City, which is based in Manchester and also has blue as their
primary color. Additionally, Everton Football Club, based in Liverpool, also has
blue as their primary color. These are just a few examples, but there are more
football teams in England with blue as their primary color.

User: Please suggest a recipe for my favorite food
AI: Sure! Since your favorite food is pizza, I can suggest a delicious homemade
pizza recipe for you. Here's one you might enjoy:

Ingredients:
- Pizza dough
- Tomato sauce
- Mozzarella cheese
- Toppings of your choice (e.g., pepperoni, mushrooms, bell peppers)

Instructions:
1. Preheat your oven to 450°F (230°C).
2. Roll out the pizza dough on a floured surface to your desired thickness.
3. Transfer the dough to a pizza stone or baking sheet.
4. Spread a layer of tomato sauce evenly over the dough, leaving a small border
around the edges.
5. Sprinkle a generous amount of mozzarella cheese over the sauce.
6. Add your favorite toppings, distributing them evenly across the pizza.
7. Bake the pizza in the preheated oven for about 12-15 minutes, or until the
crust is golden brown and the cheese is bubbly.
8. Remove the pizza from the oven and let it cool for a few minutes before
slicing and serving.

Enjoy your homemade pizza! Let me know if you need any more recipe suggestions
or cooking tips.

Stay Tuned

This article covered the basic use cases that can be achieved with the integration of TileDB and LangChain. However, we strongly believe in combining the emerging multi-modal AI capabilities and the power of TileDB to host large collections of complex multi-modal data. Stay tuned to check our updates and also keep up with all other TileDB news by reading our blog.

We'd love to hear what you think of this article. Feel free to contact us, on Twitter and LinkedIn.

Meet the authors

Nikos Papailiou

Principal Software Engineer