Retrieval Augmented Generation (RAG) with Local Embeddings

Retrieval Augmented Generation (RAG) could very well be the hottest topic in generative AI right now. It has been clear since ChatGPT took the world by storm that RAG is one of the best use cases of GenAI for enterprises. RAG changes the game for knowledge management, where instead of searching through knowledge bases and original sources using “keywords”, users can have a conversation with a chatbot in “natural language” and receive answers in the same. This can greatly improve productivity and customer satisfaction, not to mention enable use cases which were never possible before. For example, a Large Language Model could parse embedding data coming in through a data pipeline from a CRM, to provide an account executive with insight into which customers might be most likely to purchase a complementary product or service, based on their previous purchasing history. An LLM is not simply a replacement for search; it is a “copilot” that can assist employees to make smarter, data-driven decisions by parsing data as context that would be too time-consuming or sheerly impossible for a human to consider in their workflow.

A major concern of enterprises looking at adopting RAG is the necessity of sharing their proprietary, corporate data with the LLM in order to enable these AI use cases. Here is where locally hosted LLMs and embedding models, such as LLaMA2 and Unstructured, and vector databases, like Postgres with the PgVector extension come in.

For the uninitiated, what is an embedding? An embedding is a numerical representation of unstructured or semi-structured data in a vector or graph data type that AI models such as LLMs can easily consume. Many well-known databases, like PostgreSQL, MongoDB, Redis, and Elasticsearch have added support for vector data types, as they contend with newer, graph databases like Neo4j that came of age as the current AI renaissance began.

If you use a service like ChatGPT to upload and chat with your documents, you need to trust OpenAI to run your confidential data through their embedding model and safeguard it in their vector database. This limits the potential use cases for many organizations, who cannot take the chance of having a third-party like OpenAI store and process their most sensitive data.

The alternative is to build a custom AI infrastructure stack using components such as the Bionic-GPT “ChatGPT” style frontend, LLaMA 2 large language model, and Unstructured embedding model. This stack can be hosted in the environment of your choice, from your on-premise datacenter to a virtual private cloud (VPC) with one of the major providers such as Google Cloud or Microsoft Azure. With a RAG-ready GenAI stack, you don’t have to make any privacy compromises to adopt AI in your enterprise. No confidential, business sensitive data leaves your datacenter or cloud VPC to any external provider, as the vector database resides squarely within the boundaries that you define.

Of course, with a fully private AI infrastructure where you operate your own AI models, you must provision an adequate number of NVIDIA GPU accelerated servers to act as the AI worker nodes for both inference and embedding. Currently, GPUs are in relatively short supply as many organizations rush to adopt AI, which can make it difficult to secure GPU virtual machines on the major clouds. But if you are inclined to go down this path, as AI infrastructure architects, we are knowledgeable about tools including reverse proxies, load balancers, and container orchestrators that can help with scaling, securing, and managing a locally hosted AI endpoint.

It is also possible to take a hybrid approach, where the embedding is done with a local embeddings model into a local vector database, but the inference is done with an AI-as-a-Service (AIaaS) platform such as Google Vertex AI, Amazon Bedrock, or OctoAI Text and Image Inferencing Solution. This limits your exposure by eliminating the need to store your entire embedding dataset with an external provider, while allowing you to benefit from “pay as you go” per-token pricing instead of committing to the monthly cost of a fleet of GPU compute nodes.

Generally speaking, AIaaS providers charge a “per-token” fee for tokens inputted & outputted from the model. These models can include commercial models like Google Gemini and Anthropic Claude, and open models including LLaMA 2, Mistral, and Gemma. Because you pay for the privilege of using the AIaaS endpoints, the Terms of Service typically exclude your data from being used to train the model. So you can avoid the unenviable situation where “if you are not the customer, you are the product.” It is also quite difficult to computationally convert the vector data of an embedding back into plain-text. Therefore, it is unlikely that an AIaaS provider would violate their customers’ privacy by attempting to do so, outside of serving the responses to your applications’ requests. They will, however, be able to read the inputs and outputs of the model in clear-text, so we recommend AIaaS for general purpose applications, and fully private AI for the most sensitive use cases.

Aside from the “big boys” like Bedrock and Vertex AI, we can recommend OctoAI as a competitive solution for running open access AI models, without needing to worry about provisioning GPUs. With Bionic-GPT, you can upload documents such as text documents (.docx and .pdf), spreadsheet data (.xlsx and .csv), and decks (.pptx) that will be processed with a local embedding model (Unstructured) and stored in a Postgres database with the pgVector extension. There is fine-grained access control for the embedding data, which can be integrated with an identity provider such as Keycloak, Microsoft Entra ID, or another OpenID-compliant IdP. The users within a Team and Company in Bionic-GPT can create & share prompts that make use of one or more embedding data sets, using any Large Language Model.

Contact our team for AI & Data solutions about setting up Bionic-GPT Community or Enterprise Edition, in conjunction with AI middleware such as Hugging Face Text Generation Inference (TGI), LiteLLM Proxy, and Docker. The sky’s the limit what AI-enabled applications you can roll out for text-and-image generation, knowledge dissemination, customer support, and digital assistants, while meeting your data governance requirements.