Local Embeddings with Hugging Face Text Embedding Inference

Embeddings models convert data into numerical representations that can be stored in a vector database, and retrieved by a large language model through a framework such as LangChain for retrieval augmented generation. A common use case is embedding unstructured data, such as PDFs or presentations, into vector format so that users can ask questions about them through a chat model. By using a text embeddings model in conjunction with a vision model, images in a PDF can be extracted into text by the vision model, then handed off to the embeddings model to generate vectors.

Semi-structured or complex structured data, for example, sales/customer data from a CRM system, can also be embedded into a vector database through the use of an embeddings model. With that data embedded into a vector DB, a sales department can readily use a large language model as an AI copilot or assistant for tasks such as prospecting & lead generation. They could ask questions about the dataset in natural language, such as “Which customers in Anytown are the most likely to purchase widgets, based on their previous purchasing history?” For this use case, you would likely want to set up an ETL (Extract-Transform-Load) data pipeline using a tool like Airbyte or n8n, which would regularly extract fresh data from your CRM database, transform it into vector format using an embeddings model, then load said vectors into a vector DB.

Vectorizing the data allows the AI model to understand the data much more efficiently, than if the same data was presented as an integer index. It also enables the use of processing techniques such as semantic search — finding the most relevant information to a user’s prompt within a dataset, to serve as the context for RAG inference.

Like a large language model, an embeddings model can either be served by an external API or be locally hosted. For the former option, an external API is provided by the model developers (OpenAI, Anthropic) or a cloud platform (Azure AI Studio, Amazon Bedrock, Vertex AI) where you pay based on the number of tokens processed. For the latter option, you run the model on your own machines, which may need to have accelerators such as GPUs or neural processing units (NPUs) for performance. You would serve up your own API by exposing a REST endpoint directly from the container running the model itself, or through an AI API proxy like LiteLLM.

Why is it important to consider running an embeddings model locally? If the data you wish to generate embeddings from contains proprietary or personally identifiable information, it might not be permissible to process the data through an external API. Data protection regulations might also require you to consider data residence & sovereignty, where you need to ensure data processing is done within the same jurisdiction as where the subject of the data resides. With a local embeddings model, you have the utmost control over the boundaries where your data is processed into vector format, for storage in a data warehouse.

Text Embedding Inference (TEI), the counterpart to Text Generation Inference (TGI) for inference models, is a toolkit maintained by Hugging Face for the serving of embeddings models. It can be spun up as a container within your environment, whether it is in the cloud or on-premise. TEI supports text embeddings with the CPU, or accelerated by CUDA with NVIDIA hardware.

In Apr 2024, LibreChat, an open source AI chat frontend, introduced its RAG API, which is an additional microservice in the stack that can be called for embedding document data into a Postgres pgVector database. The application itself supports retrieval augmented generation for chatting with uploaded files. There are two flavors of the RAG API container, with support for:

librechat-rag-api-dev-lite – remote embeddings through OpenAI, Azure OpenAI, Hugging Face (an embeddings-as-a-service, not to be confused with Hugging Face TEI)
librechat-rag-api-dev – local embedding with Hugging Face TEI or Ollama

Ollama only recently began introducing support for embeddings models in addition to large language models, so Hugging Face TEI stands as the more mature option for serving embeddings models, with support for a greater number of different models.

If you have the latest LibreChat GitHub repository pulled onto your system, you can change from the default api-dev-lite container for remote embeddings to the api-dev container for local embeddings, using the docker-compose.override.yml file at the root of the project directory.

Uncomment the following lines to override the container image:

services:
  rag_api:
    image: ghcr.io/danny-avila/librechat-rag-api-dev:latest

By default, the RAG API service uses a Postgres database that is spun up in a container called vectordb which uses the ankane/pgvector:latest image that ships with the pgVector extension to support vector data types. The RAG API service can reconfigured to use an external Postgres database, like the Google Cloud SQL or AlloyDB service, by passing different environment variables to the rag_api container:

    environment:
      DB_HOST= 
      DB_PORT= 
      POSTGRES_DB= 
      POSTGRES_USER=
      POSTGRES_PASSWORD=
      CHUNK_SIZE=512

Next, in the same file, you will need to define the Text Embedding Inference container. We assume the development machine is running a CPU only, so we choose the image tag text-embeddings-inference:cpu-1.2 and comment out the NVIDIA-specific lines of the configuration which expose the NVIDIA drivers through the NVIDIA container toolkit. To use the GPU, use the container image without the prefix “cpu-” before the version number, for instance, text-embeddings-inference:1.2.

As long as the TEI container is running as part of the same Docker stack as the RAG API, and you do not require access to TEI from the host, you should comment out the lines that publish the container port as well. Docker’s internal DNS will automatically resolve the hostname huggingfacetei to the internal IP of the TEI container on the container network.

  huggingfacetei:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
    platform: linux/amd64
#    deploy:
#      resources:
#        reservations:
#          devices:
#            - driver: nvidia
#              count: 1
#              capabilities: [gpu]
    command: --model-id nomic-ai/nomic-embed-text-v1.5
#    ports:
#      - "127.0.0.1:3000:3000"
    environment:
      - CORS_ALLOW_ORIGIN=http://0.0.0.0:3000
      - PORT=3000
      - MAX_CLIENT_BATCH_SIZE=512
    volumes:
      - ./embedding:/data

The MAX_CLIENT_BATCH_SIZE environment variable is needed to avoid errors about the allowed batch size being too small compared to the requests that the RAG API is making to the TEI container. This value should also match the value set for the CHUNK_SIZE for the rag_api container.

ERROR embed: text_embeddings_router::http::server: router/src/http/server.rs:525: batch size 195 > maximum allowed batch size 32.

Finally, you will need to add the following lines to the .env file in the LibreChat project directory to enable vector embeddings through the RAG API using Hugging Face TEI.

EMBEDDINGS_PROVIDER=huggingfacetei
EMBEDDINGS_MODEL=http://huggingfacetei:3000
# DEBUG_RAG_API=true

If you configured the RAG API and TEI containers correctly, any documents uploaded by users to the LibreChat instance will be passed locally to the embeddings model running in the TEI container, defined through the “command:” YAML key of that service, in the above example, nomic-ai/nomic-embed-text-v1.5.

From the right sidebar pane in the LibreChat interface, users will be able to select from files they have previously uploaded & embedded, and use them with any configured large language model for a conversation.

To test the correct functioning of the Hugging Face TEI container, you can invoke the following command from the Docker host, which will execute it as the RAG API container, make a REST call to the embeddings model with cURL, then return the vector values.

docker compose exec rag_api curl -v http://huggingfacetei:3000/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

The output you should see will look like this: