Proxies & Load Balancers for AI LLM Models (AI Middleware)

The Cambrianesque explosion of capable, open Large Language AI Models represents an opportunity to extend virtually any application with AI capabilities, but a strategy for managing multiple AI endpoints is clearly needed. Hosting open models in your own environment requires careful consideration of how to expose the API in a scalable & secure way, while controlling costs and meeting governance requirements.

A proxy or load balancer for LLM endpoints is an AI infrastructure component that can help address common challenges, which we will discuss throughout this article. Our AI infrastructure architects have incorporated the following load balancing technologies into the stack for clients using both self-hosted open models like LLaMA 2, and Models as a Service running on Amazon Bedrock and Google Vertex AI.

NGINX
HAProxy
LiteLLM Proxy (AI Middleware)
Docker Swarm Load Balancing

Calling multiple models from different providers – An LLM proxy, like LiteLLM, can reduce the number of endpoints you need to manage, by unifying 100+ models such as OpenAI’s GPT 3.5/4, Anthropic Claude, Google Gemini behind a single, consistent endpoint. In addition to these AI Models as a Service residing on platforms such as Azure AI Service, Amazon Bedrock, and Google Vertex AI, LiteLLM also supports a broad range of self-hosted, open models.

The open models include Meta LLaMA 2, Mistral AI’s Mistral and Mixtral, and Google Gemma, and any variants thereof on Hugging Face with different number of parameters and quantization. You can use a tool like Ollama, Hugging Face Text Generation Inference (TGI), or LocalAI to simplify the management of the lifecycle of these models, combined with LiteLLM proxy for its load balancing and logging features.

Translating OpenAI API calls to vendor-specific calls – The OpenAI API library is practically the standard for developers integrating apps with LLMs. Most AI plugins for applications were written with integration with OpenAI’s GPT in mind. The LiteLLM proxy enables using other models like Claude, Gemini, or LLaMA, whether deployed to Bedrock or Vertex AI or hosted internally, as a “drop-in replacement” for the OpenAI API — without rewriting any code.

Load Balancing & Health Checks – For deploying open LLM models in production with any degree of scale, a load balancing strategy is indispensable. An application can quickly outgrow a single node serving up a model such as LLaMA 2 or Mistral, even if the node is accelerated with multiple NVIDIA CUDA GPUs. As long as context is stored by the application and those tokens are inputted to the model with each request, it should not matter which worker, for example running a TGI container, provides the response.

As a purpose-built proxy for AI backends, LiteLLM supports specifying multiple deployments (targets) for a given model, and load-balancing between them. It can distribute the traffic based on cooldowns, fallbacks, timeouts and retries, which can minimize queuing as an AI application gets more loaded with users. Unresponsive AI worker nodes can also be automatically taken out of the rotation if they exceed the unhealthy threshold in health checks.

General-purpose proxies such as NGINX proxy or HAProxy are also useful for scaling and securing access to locally hosted models. They can provide an additional layer of protection against OWASP API attacks based on pattern matches, and by limiting access to certain countries using a Geo IP database. Also, as the LiteLLM proxy container itself should be load balanced in a scale-out scenario, NGINX, HAProxy, or Docker Swarm’s routing mesh can fulfill this function.

As the LiteLLM proxy, in addition to the AI models themselves are deployed as containers wrapped by tools such as Ollama or TGI, Docker Swarm is a composable way to orchestrate and schedule the services across your self-hosted AI infrastructure. As an added benefit, the routing mesh of Swarm provides built-in load balancing to replicas of your LiteLLM, Ollama, and TGI containers. Any request hitting the ingress on any of the Docker Engine nodes in the cluster will be intelligently routed to the container with that published port.

If this sounds interesting for your project so far, talk with our expert AI infrastructure consultants about designing an architecture that leverages containers, AI middleware, and load balancing technologies to help you operate AI models in your environment successfully.

Authentication and Authorization – LiteLLM proxy supports creating users and proxy API keys. There are numerous good use cases for this. If multiple applications are sharing a single LiteLLM endpoint, one could create application-specific keys to separately track usage and set quotas. It is also possible to create a temporary key, which is automatically revoked after a certain period of time. This is particularly helpful for external users collaborating with your team on specific projects.

Centralizing the provider API keys (OpenAI, Bedrock, Vertex AI, OctoAI) in one place can also improve an organization’s security posture, by ensuring the master API keys are never leaked.

In the event that an application-specific LiteLLM API key is accidentally pushed to a public Github repo, it can simply be revoked and rotated in LiteLLM, since it is not the provider API key that is leaked. In fact, as long as your LiteLLM is behind a sensibly configured firewall, the leaked API key would be useless to an outside attacker, making it far less dangerous than compromising the key for a public API.

Rate Limiting – For your configured models in LiteLLM proxy, you can set rate limits by-user or by-key, defined as tokens per minute (TPM), requests per minute (RPM), or number of parallel requests. This function relies on a Redis database to measure usage in near real-time. A rate limit can facilitate better sharing of the available resources on your AI worker nodes, and avoid exceeding the rate limits enforced by external providers.

Cost Control – The LiteLLM proxy can automatically calculate the cost of the requests made to any of the external APIs in this list, allowing you to set budgets by user, key, or model. After a limit is reached, access is cut off until the accounting period elapses.

For custom models, such as on other providers like OctoAI, or models that you host yourself, you can set a custom per-token cost for budgeting and reporting purposes. A Postgres database, such as Supabase or Neon, should be available to LiteLLM in order to make use of this feature.

Logging & Observability – For troubleshooting or audit purposes, LiteLLM proxy can log input, output, and errors to external log providers, including (but not limited to) OpenTelemetry, S3, DynamoDB, or Sentry. This can be particularly helpful for moderating instances of abuse of AI large language models against your organization’s “Responsible AI” or “Acceptable Use Policy.” For sensitive industries like healthcare, government, or financial services, these logs can be helpful in proving, for example, the extent to which protected data has been processed using which endpoints.

Also, developers can use log data in their workflow to see how the configured models are responding to user prompts, and to diagnose performance issues, such as timeouts and throttling by external providers.

Caching – Currently, LiteLLM supports using a Redis store to cache “exact matches” for prompts made to a model, so that it doesn’t need to be regenerated from the AI backend. This can potentially reduce the number of API calls to the models, improving performance and reducing costs. We expect this to be an area of continued development, where “near matches” can be cached as well, so that the cache hit-rate is improved.

As LiteLLM proxy is such a versatile AI middleware tool, we have used it extensively for many projects involving both open and commercial AI models, within on-prem and cloud environments. Don’t hesitate to get in touch with our team about LiteLLM support and consulting.