cancel
Showing results for 
Search instead for 
Did you mean: 
angelborroy
Community Manager Community Manager
Community Manager

As developers building AI-powered applications, one of the critical decisions we face is how to host and serve large language models (LLMs). Whether you're using Spring AI or any other framework, response latency and developer ergonomics matter, especially in real-time scenarios like dynamic content enrichment.

In this post, we’ll walk through a project that compares two popular self-hosted LLM backends:

  • Ollama: an all-in-one tool to run models like LLaMA2 or Mistral locally with blazing speed
  • Docker Model Runner: an API-compatible wrapper for LLMs meant for containerized environments (like in Kubernetes or CI/CD setups)

We'll show you how to configure both in Spring AI, how they perform under different prompt complexities, and when it might be worth switching from Ollama to Docker Model Runner, or sticking with what you have.

The Project Setup

The code is a simple Spring Boot app configured with two REST-based LLM clients using Spring AI:

Both are wired via "ChatService" and exposed through a REST endpoint where you can choose which backend to query:

  • POST /chat/ollama for Ollama
  • POST /chat/runner for Docker Model Runner

Benchmarking Script

The script model_comparison.py runs a series of prompts from simple fact retrieval to complex creative writing. For each, it logs response times for both backends.

Benchmark Results

Here’s the output we got when running the script in Apple Silicon M1 Pro computer:

Overall Model Performance:
       Time (ms)                                ... Tokens/sec
           count      mean  median  min    max  ...       mean median    min    max   std
Model                                           ...
Ollama        50  11982.18  4137.5  330  34216  ...      23.65  24.31  18.52  27.82  2.55
Runner        50  12872.06  3548.5  295  39886  ...      24.53  24.68  16.28  28.47  2.13

Per-Prompt Performance (Token Output Rate - tokens/sec):
           mean        median           min           max
Model    Ollama Runner Ollama Runner Ollama Runner Ollama Runner
Prompt
Prompt 1  19.98  22.31  19.83  22.66  18.52  16.28  21.21  23.73
Prompt 2  23.03  23.25  23.51  22.68  20.95  20.60  24.60  25.09
Prompt 3  23.82  25.25  24.17  25.34  21.11  24.13  26.27  26.24
Prompt 4  26.95  27.29  27.19  27.33  25.63  25.85  27.82  28.47
Prompt 5  24.47  24.54  24.65  24.62  23.51  23.93  25.17  25.50

Speedup Factors (Runner vs Ollama - based on tokens/sec):
Prompt
Prompt 1    1.12
Prompt 2    1.01
Prompt 3    1.06
Prompt 4    1.01
Prompt 5    1.00

Interpretation

Consider both Ollama and Docker Model Runner as viable options since:

  • Both systems demonstrate equivalent performance across most prompt types (token generation rates differ by only 0-12%)
  • Performance differences fall within ranges that would be imperceptible to most users in real-world applications
  • Each shows balanced advantages in specific contexts

Ollama might be preferable if:

  • You're in local development or early prototyping
  • You need slightly more efficient token usage for complex comparative tasks
  • You're working with minimal configuration requirements

Consider Docker Model Runner if:

  • You're moving toward cloud-native or Kubernetes deployments
  • You need API compatibility in environments that don't support Ollama's tooling directly
  • You want integration with systems like Prometheus, CI/CD, or service mesh
  • You prefer slightly faster response times for simple queries

Final Thoughts

The real power of this project is not just comparing LLMs, it’s showing how easy it is to plug in different backends with Spring AI and keep a common application contract. This is a powerful pattern for Alfresco developers experimenting with local vs cloud LLMs or evaluating hosted APIs vs self-managed models.

Resources