Llama 3 AI Model Serving with Ollama & LiteLLM

In mid-Apr 2024, Meta debuted the Llama 3 AI model, the latest iteration of its open source large language model. Compared to its predecessor, Llama 3 was trained on a dataset of 15 trillion tokens — 7X more training data than Llama 2. Owing to this, Llama 3 is more knowledgeable about a broader range of topics, and able to generate richer responses to user prompts. Also, 5% of Llama 3’s training data is multilingual in over 30 languages other than English, allowing the model to support localized applications and AI translation use cases. In addition, the context length of Llama 3 is extended to 8K, up from 4K for Llama 2. AI applications will be able to support longer conversations with a greater number of turns, in addition to retrieval augmented generation with lengthier data sources. 

Llama 3 will be available as a foundation model on models-as-a-service platforms including Amazon Bedrock, Azure AI Studio, Hugging Face, and Vertex AI. It will also support hardware acceleration on AMD, Intel, Nvidia, and Qualcomm GPUs and AI processors, broadening the choice of commodity hardware where the model can be deployed. 

You don’t need to wait for the roll out of Llama 3 on your preferred platform to give it a try. Today, Llama-3-8B and Llama-3-70B are generally available and can be pulled from the Ollama library, a tool that simplifies model serving and management. You can download Ollama for your Linux, Mac, or Windows workstation, as well as run the Docker container to serve the Llama 3 model to any AI application that plugs into an OpenAI compatible endpoint. 

LibreChat, the most rapidly trending chat interface for AI models, supports integration with text LLMs served using Ollama. Ollama accepts either text or image (for image-to-text) inputs, and generates a textual response. There are two ways to integrate LibreChat with Ollama — 1) directly using the Ollama provider or 2) through the LiteLLM provider which proxies requests to Ollama. 

Example 1 – Serve Llama 3 using Ollama & integrate with LibreChat directly

The simplest way to get up and running with Llama 3 on LibreChat is to integrate directly with a running Ollama container. If you haven’t launched an Ollama container yet, add the following block to your docker-compose.override.yml file. It is recommended to use the override file so that your custom config will not conflict with the repo when you update LibreChat to the latest version by running a git pull. 

If you have the Nvidia Container Toolkit and Nvidia drivers configured on a host with a GPU, uncomment the lines in the “deploy” block. The downloaded models will be stored in the ollama folder of your project folder and bind mounted into the container, so ensure you have sufficient disk space. The 8B model (default) is 4.7GB and the 70B model is 40GB in size.

services:
  ollama:
    image: ollama/ollama:latest
#    deploy:
#      resources:
#        reservations:
#          devices:
#            - driver: nvidia
#              capabilities: [compute, utility]
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama

Next, before starting up the LibreChat AI stack, you will want to configure Ollama as a provider of the Llama 3 model for api, the application container. To accomplish this, add the following block under the “services” YAML key to bind mount librechat.yaml in your project directory into the LibreChat container. 

  api:
    volumes:
      - type: bind
        source: ./librechat.yaml
        target: /app/librechat.yaml

If you cloned your project directory from the LibreChat GitHub repository, you have a file named librechat.example.yaml that you can copy and adapt to your own needs. 

$ cp librechat.example.yaml librechat.yaml

Using your preferred text editor, remove all the other blocks under the “endpoints” YAML key, and add in the following block for Ollama. You can even allow users to select from Llama 3 or Llama 2 for their conversation by specifying the models as an array for the “models” key. 

“addParams” specifies a list of the stop words that Llama 3 uses to indicate the end of a conversation turn, preventing Llama 3 from endlessly generating and outputting the stop words to the user.

endpoints:
  custom:
    - name: "Ollama"
      apiKey: "ollama"
      baseURL: "http://ollama:11434/v1/chat/completions"
      models:
        default: [
          "llama3",
          "llama2"
          ]
        fetch: false # fetching list of models is not supported
      titleConvo: true
      titleModel: "llama3"
      summarize: false
      summaryModel: "llama3"
      forcePrompt: false
      modelDisplayLabel: "Ollama"
      addParams:
            "stop": [
              "<|start_header_id|>",
              "<|end_header_id|>",
              "<|eot_id|>",
              "<|reserved_special_token"
            ]

After saving & closing librechat.yaml, start the LibreChat AI stack in the project directory. It is a good idea to pull the latest version of the GitHub repository, as well as the Docker images to ensure running the latest binaries.

$ git pull
$ docker compose pull
$ docker compose up -d

To download the Llama 3 (and Llama 2, if desired) models, execute the following commands by execing into the ollama container. After each download completes you can press Ctrl-C to exit from the container back to the host terminal.

$ docker compose exec -it ollama ollama run llama3
$ docker compose exec -it ollama ollama run llama2

If you perform all of the above steps correctly, you should be able to select Ollama from the list of providers in the LibreChat dropdown, and have conversations with Llama 3 through the chat interface. 

Llama 3 model serving with Ollama and (optional) LiteLLM proxy

Example 2 – Proxy requests to Llama 3 through the LiteLLM Proxy to Ollama

There are some reasons why you might wish to intermediate requests to Ollama through the LiteLLM proxy as an AI middleware layer, instead of calling Ollama directly. LiteLLM supports features such as load balancing, virtual keys, and cost management which are particularly useful when deploying an AI model on a cluster of GPU inference machines. 

If your team is interested in locally hosting Llama 3 on GPU instances in the cloud or on-prem, get in touch with Autoize’s AI infrastructure architects. Our AI consultants are regularly involved in engagements designing & operationalizing AI models for a variety of different environments. Autoize can assist you in right-sizing the number & specs of GPU instances for the inference throughput you require, while adhering to security & data protection requirements. As an integrator of AI applications with large language models, Autoize can accelerate your organization’s journey with generative AI technologies, such as Llama.

For this example, we will assume that the Llama 3 (and Llama 2) model is still running in a container that is on the same Docker network as the rest of the LibreChat stack. This will be the case, so long as all the services are defined within the same Compose file, and you haven’t defined any custom Docker networks.

Follow the steps in Example 1 to configure the Ollama container and enable the LibreChat config file, but instead of configuring Ollama as a model provider directly, proceed with the Example 2 steps below to configure LiteLLM as the model provider instead. 

In the docker-compose.override.yml file of the project directory, additionally define the litelllm container, under the “services” YAML key:

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    volumes:
      - ./litellm/config.yaml:/app/config.yaml
    ports:
      - "127.0.0.1:3000:3000"
    command:
      - /bin/sh
      - -c
      - |
        pip install async_generator
        litellm --config '/app/config.yaml' --debug --host 0.0.0.0 --port 3000 --num_workers 8
    entrypoint: []

Then, create a subfolder in your project directory and a LiteLLM configuration file, config.yaml.

$ mkdir ./litellm
$ touch ./litellm/config.yaml

Add the following contents to litellm/config.yaml:

model_list:
  - model_name: llama3
    litellm_params:
      model: openai/llama3
      api_base: "http://ollama:11434/v1"
      api_key: sk-1234
  - model_name: llama2
    litellm_params:
      model: openai/llama2
      api_base: "http://ollama:11434/v1"
      api_key: sk-1234

litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
  drop_params: True
  set_verbose: True

general_settings:
  master_key: sk-1234

Finally, in the LibreChat configuration file librechat.yaml, add the following block to specify a custom AI provider for LiteLLM under the “endpoints” YAML key:

endpoints:
  custom:
      - name: "LiteLLM"
      # A place holder - otherwise it becomes the default (OpenAI) key
      # Provide the key instead in each "model" block within "litellm/litellm-config.yaml"
      apiKey: "sk-1234"
      # See the required changes above in "Start LiteLLM Proxy Server" step.
      baseURL: "http://litellm:3000"
      # A "default" model to start new users with. The "fetch" will pull the rest of the available models from LiteLLM
      # More or less this is "irrelevant", you can pick any model. Just pick one you have defined in LiteLLM.
      models:
        default: ["llama3","llama2"]
        fetch: true
      titleConvo: true
      titleModel: "llama3"
      summarize: false
      summaryModel: "llama3"
      forcePrompt: false
      modelDisplayLabel: "LiteLLM"
      addParams:
            "stop": [
              "<|start_header_id|>",
              "<|end_header_id|>",
              "<|eot_id|>",
              "<|reserved_special_token"
            ]

Remember to bring your LibreChat AI stack down and up again before testing the integration of Llama 3 with LibreChat. If everything was configured correctly, in the dropdown menu of the chat interface, LiteLLM will be available as a provider, with Llama 3 and Llama 2 as the possible models. If you encounter any errors, check whether you have run the docker exec commands described in Example 1 to download the models.

$ docker compose down 
$ docker compose up -d
AI Middleware, AI Proxy, LibreChat, LiteLLM, Llama 3, Ollama, Open AI Models
Previous Post
Migration to Enterprise Linux Rebuilds – Rocky Linux & Alma Linux
Next Post
Deploy Anthropic Claude 3 with AI Models-as-a-Service

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.