Hyland Connect

angelborroy · ‎04-02-2025

In this guide, we will walk through updating an existing Spring Boot project (using the Spring AI library) from relying on Ollama for local LLMs to using Docker’s new Model Runner feature. The focus is practical and developer-oriented – expect step-by-step setup instructions, configuration snippets, and tips to ensure everything runs smoothly on Apple Silicon with optimal performance.

Prerequisites and Context

Apple Silicon Mac (M1/M2 or newer) – Docker Model Runner is currently supported only on Mac with Apple Silicon (M1–M4 series). Windows support (for NVIDIA GPUs) is planned later in 2025.
Docker Desktop 4.40+ – Model Runner was introduced as a Beta in Docker Desktop 4.40
Spring AI 1.0+ – The Spring AI library supports multiple AI backends (OpenAI, Ollama, etc.). We will leverage its OpenAI compatibility to integrate Docker’s Model Runner.
Existing Ollama setup – We assume your project currently uses Ollama (likely via the "spring-ai-starter-model-ollama" dependency), in the same way as the Alfresco AI Framework project. Ollama is an open-source tool that runs local LLMs and provides an API (OpenAI-compatible) on your machine.

Why Docker Model Runner?

Docker’s Model Runner provides a “Docker-native” way to run LLMs locally with GPU acceleration on Apple Silicon Macs. Unlike Ollama (which runs as a separate service or container), Model Runner tightly integrates with Docker Desktop and uses a host-based inference engine (currently powered by "llama.cpp") for maximum performance. In practice, this means you can expect faster inference by leveraging the Apple GPU/neural engine and a seamless experience managing models via Docker CLI. Model Runner supports a range of open models out-of-the-box (packaged as OCI artifacts on Docker Hub) – including popular ones like Mistral 7B, LLaMA-based models, and Alibaba’s Qwen (2.5B). Here is the full list of supported models. We’ll see how to pull and run these models next.

Setting Up Docker Model Runner

First, set up Docker Model Runner on your Mac and verify you can run models locally:

1. Update & Enable Docker Model Runner: Make sure Docker Desktop is updated to v4.40 or above. On Docker Desktop Settings, under Experimental Features, ensure “Model Runner” is toggled on (in 4.40+ it should be on by default).

After that, check the status:

docker model status

This should report the Model Runner is running.

2. Familiarize with Model CLI: Docker introduces a new "docker model" CLI for managing models. For example:

"docker model list" – List available models you can pull/run
"docker model pull <model>" – Download a model from Docker Hub
"docker model ls" – List models you have downloaded locally
docker model run <model> "<prompt>" – Run a model with a given prompt (non-interactive)

These commands behave similarly to other Docker CLI commands (like pulling images, etc.)

3. Pull an LLM Model: Next, download a model to use. Docker’s GenAI model registry is under the **ai** namespace on Docker Hub. For example, to pull the Mistral 7B model:

docker model pull ai/mistral

The CLI will fetch the model (the first pull can be large and slow, but it’s cached afterward). You can run "docker model list" to see other available models – e.g. "ai/gemma3" (Google’s Gemma 3), "ai/llama3.2" (Meta LLaMA-based), "ai/qwen2.5" (Qwen 2.5B), etc

4. Test the Model Locally: Once pulled, test it via CLI:

docker model run ai/mistral "Hello, how are you?"

You should see the model produce a completion for the prompt (e.g. a greeting response). You can also start an interactive session with "docker model run ai/mistral" (no prompt argument) – type queries and see responses in a REPL-like chat. This confirms the model runs correctly on your machine.

5. Enable HTTP Access (for outside containers): Docker Model Runner runs as a background service accessible via Docker’s Unix socket by default. To call it from our Spring Boot app (running on the host), it’s easiest to enable a TCP port for the Model Runner API. Use the Docker Desktop CLI or GUI to expose it. For example, in a terminal run:

docker desktop enable model-runner --tcp 12434

This will enable the API on a localhost port (12434 is the default if not specified). After enabling, the Model Runner’s OpenAI-compatible endpoints will be available at "http://localhost:12434/...". We will use this URL in our Spring configuration.

With Docker Model Runner up and running, and a model downloaded and tested, we’re ready to integrate it into the Spring Boot application.

Migration to Docker Model Runner to replace Ollama

In this approach, we will remove the Ollama integration entirely and make Docker Model Runner the sole local LLM backend.

1. Remove Ollama Dependencies and Config

Remove Ollama Starter: If your project includes the Ollama Spring AI starter (e.g. Maven artifact "spring-ai-ollama-spring-boot-starter"), remove it from your build. Instead, include the OpenAI starter (often included by default in Spring AI, but ensure "spring-ai-openai-spring-boot-starter" dependency is on classpath).
Disable Ollama in Config: In your "application.yaml", remove or disable any Ollama-specific settings. For example, properties like "spring.ai.ollama.*" should be removed. We will introduce new properties for Docker Model Runner (via OpenAI config) next.

2. Configure Spring AI to Use Docker Model Runner (OpenAI API Mode)

Docker Model Runner exposes an OpenAI-compatible REST API. This means we can treat it similarly to OpenAI’s service but pointing to our local endpoint (and no API key needed). Spring AI’s OpenAI integration can be repurposed for this.

Update your Spring Boot configuration in "application.yml":

spring:
  ai:
    model:
      chat: openai
    openai:
      base-url: http://localhost:12434/engines
      api-key: nokeyrequired
      init:
        pull-model-strategy: when_missing
      chat:
        options:
          model: ai/qwen2.5
          temperature: 0.0
      embedding:
        options:
          model: ai/mxbai-embed-large

Explanation: We point the Spring AI OpenAI client to "http://localhost:12434/engines" as the base URL. The "api-key" is set to a useless value, as the Docker’s local LLM service does not require any authentication (just like Ollama didn’t require a token). We also set the default chat model name to "ai/qwen2.5". This ensures the request payloads specify that model. Additionally, we're specifying an embedding model "ai/mxbai-embed-large" to populate the vector database.

Endpoint differences: Under Ollama, the default API base was "http://localhost:11434". Now it’s "http://localhost:12434/engines". Fortunately, Spring AI abstracts these details – as long as the base URL and model name are configured, your existing service layer (e.g. making a chat completion request) should work the same. The JSON responses from Model Runner are designed to mimic OpenAI’s format, so parsing logic remains unchanged.

Changes required to support Docker Model Runner for the Alfresco AI Framework are available in https://github.com/aborroy/alfresco-ai-framework/compare/feature/docker-model-runner?expand=1

3. Verify the Integration

Build and run your Spring Boot application. Test the portions of your app that call the LLM:

If you have a REST endpoint in Spring that triggers a chat completion, call it and observe the response. The response content should now be coming from the local model (Mistral or whichever you set) via Docker.
Check the application logs for any errors. If the Model Runner API is unreachable or misconfigured, you’ll likely see connection errors – double-check the base URL (and that the Docker Desktop Model Runner TCP port is enabled). If you get an OpenAI authentication error, ensure that Spring isn’t expecting an API key – "api-key" should be something for local use.
Performance test: Try a few queries and note the latency. You should find it comparable or better than Ollama’s performance, thanks to GPU acceleration. For instance, generating a few hundred tokens of text with a 7B model on M1 should be noticeably faster under Docker Model Runner’s optimized engine.

If everything looks good, congratulations – your Spring application is now fully running local LLMs through Docker Model Runner, with no Ollama dependency! You can manage models via Docker (pull new versions, remove unused ones with "docker model rm", etc.), and your app remains unchanged in how it interacts with the AI.

Performance Considerations on Apple Silicon

One of the motivations to use Docker Model Runner is performance. Internally, it uses "llama.cpp" with Apple’s Metal acceleration to run models on the GPU/Neural Engine. This can significantly speed up inference for supported models, especially compared to CPU-bound execution. If your Apple Mac has a robust GPU (M1 Pro/Max or M2), Model Runner should utilize it. You can monitor CPU/GPU usage via Activity Monitor to verify that the workload shifts to the GPU when using Model Runner.

Additionally, the models provided via Docker Hub are often quantized (e.g., 4-bit or 8-bit) to balance speed and memory usage. For instance, a model tag like `1B-Q8_0` denotes 8-bit quantization. These quantizations slightly reduce precision but run faster and use less RAM – important for local environments. Ollama also supports quantized models (GGUF/GGML format), so performance might be similar if both are using llama.cpp under the hood. However, Docker Model Runner’s tight integration may give it an edge, and it simplifies leveraging the GPU without extra configuration.

If you need maximum performance, use the smallest model that meets your needs. For example, if Qwen-2.5B or a distilled 3B model suffices, it will be much faster than a 7B LLaMA. Docker Model Runner’s model catalog includes some “small-but-efficient” models (like the 1B LLaMA 3.2 variant) you can experiment with.

Conclusion

By following this guide, you’ve modernized your Spring Boot AI application to use Docker’s Model Runner for local LLM inference. You’ve shed the extra Ollama service and now manage models through Docker – enjoying benefits like integrated CLI management and potentially faster generation using Apple’s GPUs.

Recap of key steps

Upgraded Docker and pulled LLM models via the new "docker model" CLI (supported models include Mistral, LLaMA2-family, Qwen 2.5B, etc.)
Enabled the Model Runner’s OpenAI-compatible API endpoint on "localhost:12434"
Reconfigured Spring AI to point to the local Model Runner instead of Ollama, using the OpenAI integration with a custom base URL (no API key needed).
Verified that the application works with the new backend and observed improvements in speed on Apple Silicon.

Going forward, you can keep an eye on Docker’s updates – the Model Runner feature is still evolving. For example, more models or engines (beyond llama.cpp) might be supported. The good news is your Spring AI abstraction will likely continue to work with minor config tweaks for new versions.

By embracing Docker Model Runner, you maintain the benefits of local AI (data privacy, offline capability) while using Docker’s robust tooling to manage your models.

Happy coding with your locally hosted LLMs!