Nextcloud Assistant 2.0 – AI Text Gen with Phi-3 & Transcription with Whisper

Nextcloud Assistant 2.0 is the self-hosted file sync & share and groupware suite’s answer to Microsoft 365 Copilot and Gemini for Google Workspace. Compared to Copilot and Gemini from its Big Tech brethren, Nextcloud Assistant has a number of unmatched benefits for privacy-conscious users where data security and sovereignty are of paramount importance. Assistant is the leading AI assistant that can be integrated into your document workflows and hosted on your own network — in the virtual private cloud or completely on-premise.

Nextcloud Assistant resides in your own environment, but it’s equally important that the AI models powering its natural language capabilities are also under your control. Microsoft 365 and Google Workspace’s AI assistants rely on OpenAI GPT-4 and Gemini models respectively, requiring you to allow Microsoft and Google’s models to trawl through all your private documents on a deployed instance of the model that they control. Nextcloud Assistant puts the power back into your hands, by allowing you to integrate Assistant with any LLM or SLM of your choice — served on your own hardware for private inference, or with an AI-as-a-Service provider that you trust.

The Microsoft Phi-3 small language model (SLM), a lightweight 3.8B parameter model that can operate on CPU machines with AVX support and as little as 4 GB of RAM but be competitive with GPT-3.5, Mixtral 8x7b, or even Llama 3 8B’s reasoning capabilities, is a particularly interesting candidate for integration with Nextcloud Assistant. Most users have Nextcloud deployed on servers with CPUs but not GPUs, so Phi-3 is a capable option for hosting a local AI model without necessarily needing to invest in GPU instances or hardware. A SLM vs. LLM is optimized for resource-constrained machines, especially for local processing at the edge. Enthusiasts have even demonstrated the Phi-3 Mini model running on a Raspberry Pi 5 with over 4 tokens per second.

Out of the box, the Nextcloud local large language model app supports the previous generation models, LLaMA 2 and Falcon 7B. You might want to try a newer, more capable model than that for local inference. Fortunately, Nextcloud Assistant was designed to be extensible with any model served through an OpenAI API-compatible endpoint so in many cases, you will want to use that instead of the LLM app. Using a combination of Ollama for model serving and LiteLLM as AI middleware to translate Nextcloud’s OpenAI API calls into Ollama requests, it is possible to use Nextcloud Assistant with any model that Ollama supports in their extensive model library. Likewise, if you wanted to host a community model from Hugging Face, or your own custom trained / fine-tuned checkpoint using the Text Generation Inference (TEI) container, Nextcloud Assistant can also integrate with those models through the LiteLLM proxy.

A more sophisticated set up on-premises as you scale out might involve a load balanced cluster of GPU machines behind LiteLLM with the load balancing feature, or other HTTP proxies such as NGINX or HAProxy. Nextcloud Assistant is extensible, and can support AI models:

running on the same machine as your Nextcloud server
off-loaded to a separate CPU or GPU machine from your Nextcloud server
behind a load-balanced endpoint of GPU inference servers

The Autoize AI infrastructure and Nextcloud Implementation consulting teams had reliable success integrating Nextcloud Assistant 2.0 with virtually any model exposed through LiteLLM, whether the underlying model was served locally (using Ollama or TEI) or by an external provider such as Azure OpenAI, Anthropic, Vertex AI, or Bedrock. The LiteLLM proxy provides a consistent OpenAI compatible interface for Nextcloud’s AI features to integrate with, even where certain models & endpoints have not implemented all (or any) of the features of the OpenAI API.

Text Generation using Phi-3 Model with Nextcloud Assistant 2.0

Besides generating text like GPT AI chatbot, Nextcloud Assistant 2.0 introduced the “Context Write”, “Summarize”, “Generate Headline”, and “Reformulate” features using the language model that you integrate.

Context Write – Write a passage of text based on a prompt, and a writing style that you describe, or an example document.
Summarize – Condense a lengthy document into a summary that maintains its key points.
Generate Headline – Generate a title based on the language model’s understanding of a text.
Reformulate – Rewrite text in a different way by paraphrasing its content.

Audio Transcription using OpenAI Whisper Model

Nextcloud Assistant’s “Transcribe” feature works in conjunction with an app called Whisper Speech-to-Text that uses the local OpenAI Whisper model to accurately transcribe a recording or uploaded audio file to text. Even though the model is developed by OpenAI, it is freely available and can be run on any sufficiently powerful CPU or GPU server. There are three available model sizes for Whisper (small, medium, large) in Nextcloud Assistant, with a trade-off between speed and error rate. The smaller the model, the speedier it is, and the larger the model, the more accurate it is.

The OpenAI Whisper model currently supports 99 languages, but 65% of the training data for the model was in English. For this reason, the small model is usually capable enough for English use cases, but medium can significantly reduce the word (WER) and character (CER) error rate for multilingual usage with audio in other languages.

Because an ordinary language model like Phi-3 or Llama-3 cannot understand audio directly, first using an automatic speech recognition (ASR) model to transcribe speech-to-text unlocks a plethora of possibilities in your users’ workflows.

Example AI Workflow with Nextcloud Assistant 2.0

In the remainder of this article, we will demonstrate an example workflow of how one can:

extract audio from any YouTube video or meeting recording
upload it to their Nextcloud share
transcribe it to text using the Whisper model
summarize or reformulate the text using Assistant, powered by the Phi-3 model
save the transcription to Nextcloud Office and chat with the document in LibreChat (optional)

Additionally, one can also save the transcribed text as a document using Nextcloud Office and upload it to a ChatGPT clone like LibreChat for AI-powered conversations with it using retrieval augmented generation. LibreChat will use a configured embeddings model to vectorize the uploaded document into a PostgreSQL vector database. The documents can be called up in future conversations as well.

Even though Nextcloud Assistant does have a feature called Context Chat, it relies on the new AppAPI that uses a deploy daemon that requires exposing the Docker API socket from the host machine. It introduces the complexity of running the AppAPI Docker Socket Proxy container, which essentially gives Nextcloud root access to your machine through a TCP port. If you go with this route, correct firewall configuration and setting up authentication is essential to maintain the security of your environment.

We also found LibreChat’s interface better optimized for multi-turn conversations, with streaming text, and a clear, drop down menu to easily switch between different language models. In Nextcloud Assistant’s Context Chat, the model priority is determined by the administrator and applies to all users across the instance. As it is very common for users to prefer using different LLMs or SLMs for different tasks, in our view, the additional flexibility of LibreChat’s UI is welcome for many use cases.

Requirements

Whisper Speech-to-Text Nextcloud app

- - Install from the app store, and run sudo -u <server user> php /path/to/nextcloud/occ stt_whisper:download-models <model name> on the Nextcloud server to download the Whisper model.

OpenAI and LocalAI Integration Nextcloud app

- - Install from the app store (requires further configuration in Administration > Connected Accounts)

- - - LocalAI URL: http://localhost:4000

- - - Choose endpoint: Chat completions

- - - Default completion model to use: ollama/phi3 (openai)

- - - Authentication method: API key

- - - API key: ollama

- - - Max new tokens per request: The length of the Nextcloud system prompt is 804 tokens, so the value to enter here is the supported context length of model – 804. Phi-3 Mini has a 4K context window, so the appropriate value to use is 3196. If the system prompt + user prompt exceeds the context window of the model, users will get an “Assistant error” message.

- - - Select enabled features: Text generation with the Smart Picker, Speech-to-text provider (to transcribe Talk recordings for example)

Nextcloud Assistant Nextcloud app

- - Install from the app store

Docker daemon installed on your server

Custom Docker network

- - Create with: docker network create <network name>

Ollama container & Microsoft Phi-3 model

- - Run with: docker run –name ollama –restart unless-stopped –volume /root/ollama:/root/.ollama –net <network name> -p 127.0.0.1:11434:11434 -d ollama/ollama
  - Download the Phi-3 model in the container: docker exec -it ollama ollama run phi3

LiteLLM container

- Run with: docker run –name litellm –restart unless-stopped –net <network name> -p 127.0.0.1:4000:4000 -d ghcr.io/berriai/litellm:main-latest –model=ollama/phi3 –api_base=http://ollama:11434 –drop_params –host 0.0.0.0 –port 4000 –num_workers 4

Extract audio from any YouTube video

We will start by using yt-dlp, a command line utility to download the audio from a YouTube video. The utility can be installed using a variety of methods by downloading the binary and moving it into your path, or using a package manager like pip or Homebrew, depending on your operating system.

To download only the audio from a YouTube link, pass the -x flag and the YouTube URL to yt-dlp command at the terminal or command prompt.

$ yt-dlp -x <youtube url>

Next, you will need to upload the .opus audio file (other audio formats, like .mp3, are also supported) to your Nextcloud share either through the desktop client or the web interface. For large uploads, the desktop client is faster & more reliable.

Then, click the Assistant icon (three stars) in the navigation bar of the Nextcloud UI to launch Nextcloud Assistant 2.0 and select the “Transcribe” feature. Instead of making a new recording using the microphone in the browser, toggle to “Choose audio file” and use the smart picker to browser for the recording you previously uploaded.

Click the “Transcribe” button to schedule the job in the background. The transcription will be invoked at the next run of the Nextcloud system cron job, and proceed in the background. As soon as the transcription is complete, you will be notified by a Nextcloud notification, a push notification (if enabled), and through the Nextcloud mobile app if you have it installed.

Click “View results” in the notification to open the Assistant dialog and view the transcribed text.

To copy the transcribed text with one-click, click the “Copy” button. Then, switch the Assistant to the “Summarize” or “Reformulate” modes to use the Phi-3 language model to generate a summary of the transcription or paraphrase the text.

If you encounter an error like “Summarize task for Nextcloud assistant has failed”, you should try breaking up the text into separate chunks so that it doesn’t exceed the context length supported by the model or timeout.

For the next part of this workflow, create a new Nextcloud Office document, copy & paste the complete text transcribed by the Whisper model into the body. You can also go back to access the previously completed tasks by Nextcloud Assistant. Simply launch Assistant from the navigation menu, select the Assistant feature you want to see the previous tasks from, then click the “Previous tasks” button in the bottom-left corner of the dialog.

Save & close the document when finished, then download it to your local workstation.

In an existing LibreChat instance set up with the Azure OpenAI model text-embedding-3-small to embed documents into a Postgres vector database, we start a new conversation and upload the transcription for retrieval augmented generation.