Run Open AI Models for Enterprise with a ChatGPT “Clone”

Are you looking for an alternative to ChatGPT Enterprise with no minimum number of seats or annual contract, with enterprise features such as single-sign on (SSO) through Google Workspace, OpenID, or Microsoft Entra ID? It is worth considering running an open LLM such as Llama 2, behind a ChatGPT-style web application deployed at your domain. There is no per-seat charge for the users you wish to onboard, and the open source ChatGPT clones available are compatible with both local generation, or using an external API such as OctoAI’s Text Gen service. Also, you have full visibility to supervise how your users are using GenAI by logging conversations to an ELK stack, and grant or revoke access as needed.

Rolling out a ChatGPT-style interface also enables your organization to benefit from future advancements in any open access foundation model, as new models can be easily integrated so long as they are exposed via an OpenAI-compatible API. Open access AI models including LLaMA 2 and Mixtral can readily compete with the reasoning performance of GPT-3.5, even with far smaller number of parameters (7-70B parameters compared to 175B). As a result, the open models can be deployed with much less VRAM required in the GPUs, or even using a CPU for an individual user or proof of concept. You also have the option of fine-tuning an open model to suit your specific use case, but most of the time, using embeddings with a stock model is less time-consuming than fine-tuning.

There are two main deployment methods for running a custom AI model behind a ChatGPT-style frontend. The first option is deploying your own AI worker nodes, which are GPU virtual machines running a container such as Hugging Face TGI or Ollama with your AI model. These nodes are orchestrated by a container orchestrator such as Docker Swarm or Kubernetes, and the traffic is distributed between them by an AI proxy and/or load balancer such as LiteLLM, NGINX, or HAProxy. Our AI infrastructure architects can design such an architecture for your LLM-powered application, and deploy it in an on-prem or cloud environment.

The second option is AI Models-as-a-Service (AI MaaS), where you outsource the running of the AI models themselves to a provider such as Amazon Bedrock, Google Vertex AI, or OctoAI (they debuted their Image and Text Gen solution at DockerCon 2023). Using an LLM proxy such as LiteLLM, we can adapt the Bedrock or Vertex AI APIs into an OpenAI-compatible endpoint, by translating the API calls from your AI application. OctoAI provides an OctoAI-compatible API by default, so it can be directly integrated into any app that supports changing the base URL. However, it is often still worthwhile to use a proxy container that supports streaming text generation for logging, rate limiting, and cost control functionality.

Self-Hosted AI Models vs. LLMs-as-a-Service

The main difference between self-hosting AI models and Models-as-a-Service is how the cost is incurred. Hosting an AI model requires provisioning a number of GPU VMs, which incurs a monthly cost, or purchasing the hardware upfront in your on-prem datacenter, whether users are accessing the GPT application or not. The main advantage is a high degree of privacy and data sovereignty, as you directly control the machines inferencing based on your users’ prompts. You can select the region & zone where these machines reside, assisting with compliance with data protection regulations & disaster recovery planning.

With Models-as-a-Service, you do not directly pay for the compute instances or the hardware running your model. Instead, model usage is metered by the number of tokens inputted and outputted from the model (per-token). A token may be part of or an entire word, so the number of tokens will somewhat exceed the number of words in a prompt and response. For reference, using OctoAI’s Text Gen solution, for $10, you can generate 500K tokens using the most powerful LLaMA 2 70B model, or 1M tokens using Mixtral 8x7B.

Integrating an LLM application with AI MaaS is also relatively straightforward to set up, as you outsource the scaling of the model to the AIaaS provider. You only need to operate the frontend of the application itself in a VM or container, while the heavy lifting with the GPUs is handled on your behalf. As a “pay as you go” option without the cost of directly renting or purchasing GPUs, it is excellent for an organization’s first foray into AI.

Best Open Source ChatGPT “Clones” for Enterprise

Once you decide whether to go with hosting your own AI model, or a AI-as-a-Service, you should select a frontend to use for your ChatGPT-style user experience. There are a number of open source projects out there, but among the most robust ChatGPT clones are:

Bionic-GPT – Best for enterprises focusing on Retrieval Augmented Generation (RAG) who need to share custom embeddings between users and teams. Role-based user management. Some features are freemium, with a Community and Enterprise Edition.
LibreChat – Best all around interface for most teams. SSO with Google, OpenID, and Entra (formerly Azure AD) is well supported. Saved conversations and users are managed in MongoDB, which is easily integrated with Atlas.
Chatbot-UI – Best for individual users or small teams. Relies on a backend called Supabase, an alternative to Google Firebase.

Autoize LLC is not an affiliate or partner of OpenAI, the developers of ChatGPT. These open source ChatGPT-style chat interfaces use chat models not provided by OpenAI, but by the open access AI model community.

LibreChat and Chatbot-UI are mobile responsive, and can be pinned to a users’ home screen on their smartphone so that the app behaves similar to the ChatGPT app. Chats are saved on the server-side, so that users can continue previous conversations without losing context at a later time. Compared to the ChatGPT app where OpenAI uses user chats to train their models (at least in the free, consumer version), the data remains private on your servers. AI as a Service providers do not use your chats to train their models.

We also implement a reverse proxy such as NGINX with SSL/TLS in front of the ChatGPT-style chat interface to ensure end-to-end encryption of the chats between the user, your servers, and the external API (if using AIaaS). The solution can even be run at the edge behind a firewall, if your use case shall require it.

We can assist with all aspects of deploying Bionic-GPT, LibreChat, Chatbot-UI and other “ChatGPT clones” with open LLMs with locally hosted or AIaaS models. If you require customizing or branding the experience for your internal users, integrating SSO, and audit logging for the highest standards of security & compliance, contact us to discuss your project. For the most relevant deployment recommendations and estimates, please include the:

Number of users
Estimated tokens generated per month (if available)
Preferred LLMs to use. We are able to integrate large language models from the Hugging Face, GPT4All, or LocalAI catalogs, particularly the most popular models such as LLaMA 2, Falcon, Mistral (and Mixtral), and Google Gemma.