TileDB vector search is now available in LangChain, one of the most popular large language model (LLM) application development frameworks.
In this blog, we explore how to use LangChain and TileDB to implement powerful LLM applications. We present 2 examples that allow LLMs to:
The methodology behind these simple examples can easily be used to build much more sophisticated LLM applications.
The examples in this article are summarized in a TileDB Cloud notebook, which you can either download and run locally, or run directly in TileDB Cloud. Sign up to do so (no credit card required), and we’ll give you free credits so that you can evaluate without hassle.
If you are not familiar with TileDB vector search and how it differentiates from other Vector databases, I strongly recommend reading the blogs: "Why TileDB for Vector Search" and “TileDB 101: Vector Search”.
LangChain is a powerful software development framework that simplifies the process of building advanced large language model (LLM) applications. Some of the main concepts of LangChain are:
Let’s explore how you can leverage TileDB vector search to implement two of the main LangChain use cases, RAG and Memory.
In this blog we use Python. To install TileDB-Vector-Search, LangChain and the other required dependencies, run:
pip install "tiledb-vector-search>=0.0.18" "langchain>=0.0.331" openai tiktoken
We start by importing the necessary libraries:
import numpy as np
import os
import shutil
import time
from langchain.chains import ConversationalRetrievalChain, ConversationChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.txt import TextParser
from langchain.memory import VectorStoreRetrieverMemory
from langchain.prompts import PromptTemplate
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores.tiledb import TileDB
# remember to set your OpenAI api key here
os.environ["OPENAI_API_KEY"] = 'sk-...'
One of the limitations of LLM models is that their knowledge extends only to the data that were used during their training. Public training datasets are missing private and proprietary information required for enterprise applications. They are also missing information about the world and events that happened after the dataset creation time. This problem affects all types of LLMs public or proprietary, even those deployed and used locally (e.g., in sensitive enterprise applications).
In this example, we use TileDB vector search to allow the gpt-3.5-turbo
model to answer questions about LangChain. Most ChatGPT models have limited world knowledge after 2021, the training data cutoff date. LangChain was created and became popular after 2021. Although we use ChatGPT 3.5, this example can be easily extended to other LLMs.
Let’s start by asking ChatGPT some questions about LangChain.
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0
)
public_chatgpt = ConversationChain(llm=llm)
question = "What is langchain?"
print(f"User: {question}")
print(f"AI: {public_chatgpt.run(question)}")
User: What is langchain? AI: I'm sorry, but I don't have any information about "langchain." Could you provide more context or clarify what you're referring to?
We now use LangChain’s documentation to allow ChatGPT to answer questions about the project.
We first need to download the documentation from the project repo.
git clone https://github.com/langchain-ai/langchain.git
The next step is to index the documentation using TileDB’s vector indexing.
# Parse markdown documents and split them into text chunks
documentation_path = "./langchain/docs"
loader = GenericLoader.from_filesystem(
documentation_path,
glob="**/*",
suffixes=[".mdx"],
parser=TextParser()
)
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN,
chunk_size=1000,
chunk_overlap=100
)
documents = loader.load()
print(f"Number of raw documents loaded: {len(documents)}")
documents = splitter.split_documents(documents)
documents = [d for d in documents if len(d.page_content) > 5]
texts = [d.page_content for d in documents]
metadatas = [d.metadata for d in documents]
print(f"Number of document chunks: {len(texts)}")
# Generate embeddings for each document chunk
print("Generating embeddings...")
t1 = time.time()
embedding = OpenAIEmbeddings()
text_embeddings = embedding.embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
t2 = time.time()
print(f"Embeddings generated. Total time: {(t2-t1)}s")
# Index document chunks using a TileDB IVF_FLAT index
print("Indexing...")
tiledb_index_uri = "./tiledb_langchain_documentation_index"
if os.path.isdir(tiledb_index_uri):
shutil.rmtree(tiledb_index_uri)
db = TileDB.from_embeddings(
text_embedding_pairs,
embedding,
index_uri=tiledb_index_uri,
index_type="IVF_FLAT",
metadatas=metadatas)
t3 = time.time()
print(f"Indexing completed. Total time: {(t3-t2)}s")
print(f"Number of vector embeddings stored in TileDB-Vector-Search: {len(text_embeddings)}")
Number of raw documents loaded: 255 Number of document chunks: 1196 Generating embeddings... Embeddings generated. Total time: 5.3482959270477295s Indexing... Indexing completed. Total time: 1.5287487506866455s Number of vector embeddings stored in TileDB-Vector-Search: 1196
Let’s now ask for our augmented ChatGPT version. Observe that now ChatGPT returns relevant information about LangChain!
db = TileDB.load(index_uri=tiledb_index_uri, embedding=embedding)
llm = ChatOpenAI(
model_name="gpt-3.5-turbo",
temperature=0
)
retriever = db.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
private_chatgpt = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)
question = "What is langchain?"
print(f"User: {question}")
print(f"AI: {private_chatgpt.run({'question': question, 'chat_history': ''})}\n")
User: What is langchain? AI: LangChain is a framework for developing applications powered by language models. It allows developers to build context-aware and reasoning applications by connecting a language model to various sources of context. LangChain consists of several parts, including LangChain Packages (Python and JavaScript packages), LangChain Templates (a collection of reference architectures), LangServe (a library for deploying LangChain chains as a REST API), and LangSmith (a developer platform for debugging and monitoring chains).
We are now going to use TileDB vector search to store the interaction history of a user with an LLM. This allows an LLM to remember past conversations and user preferences.
First, we need to create a TileDB vector index that will host the conversation history and add some data to it.
# Create a TileDB vector index to store the conversation history
tiledb_index_uri = "./tiledb_chat_history_index"
if os.path.isdir(tiledb_index_uri):
shutil.rmtree(tiledb_index_uri)
embedding_size = 1536 # Dimensions of the OpenAIEmbeddings
TileDB.create(
index_uri=tiledb_index_uri,
index_type="IVF_FLAT",
dimensions=embedding_size,
vector_type=np.float32
)
vectorstore = TileDB.load(
index_uri=tiledb_index_uri,
embedding=OpenAIEmbeddings()
)
retriever = vectorstore.as_retriever(search_kwargs=dict(k=2))
memory = VectorStoreRetrieverMemory(retriever=retriever)
# Add some conversation history
memory.save_context(
{"input": "My name is Nikos"}, {"output": "Hello Nikos"})
memory.save_context(
{"input": "My favorite food is pizza"}, {"output": "This is a classic choice"})
memory.save_context(
{"input": "Blue is the best color"}, {"output": "Green is also nice"})
Let’s now use the TileDB vector index during a conversation with the user.
llm = ChatOpenAI(
model="gpt-3.5-turbo",
)
qa = ConversationChain(llm=llm, memory=memory)
question = "What is my name?"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")
question = "Are there any football teams with my favorite color in England?"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")
question = "Please suggest a recipe for my favorite food"
print(f"User: {question}")
print(f"AI: {qa.predict(input=question)}\n")
User: What is my name? AI: Your name is Nikos. User: Are there any football teams with my favorite color in England? AI: Yes, there are several football teams in England with blue as their primary color. One example is Chelsea Football Club, which is based in London and has blue as their main color. Another example is Manchester City Football Club, also known as Man City, which is based in Manchester and also has blue as their primary color. Additionally, Everton Football Club, based in Liverpool, also has blue as their primary color. These are just a few examples, but there are more football teams in England with blue as their primary color. User: Please suggest a recipe for my favorite food AI: Sure! Since your favorite food is pizza, I can suggest a delicious homemade pizza recipe for you. Here's one you might enjoy: Ingredients: - Pizza dough - Tomato sauce - Mozzarella cheese - Toppings of your choice (e.g., pepperoni, mushrooms, bell peppers) Instructions: 1. Preheat your oven to 450°F (230°C). 2. Roll out the pizza dough on a floured surface to your desired thickness. 3. Transfer the dough to a pizza stone or baking sheet. 4. Spread a layer of tomato sauce evenly over the dough, leaving a small border around the edges. 5. Sprinkle a generous amount of mozzarella cheese over the sauce. 6. Add your favorite toppings, distributing them evenly across the pizza. 7. Bake the pizza in the preheated oven for about 12-15 minutes, or until the crust is golden brown and the cheese is bubbly. 8. Remove the pizza from the oven and let it cool for a few minutes before slicing and serving. Enjoy your homemade pizza! Let me know if you need any more recipe suggestions or cooking tips.
This article covered the basic use cases that can be achieved with the integration of TileDB and LangChain. However, we strongly believe in combining the emerging multi-modal AI capabilities and the power of TileDB to host large collections of complex multi-modal data. Stay tuned to check our updates and also keep up with all other TileDB news by reading our blog.
We'd love to hear what you think of this article. Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.