

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
As developers building AI-powered applications, one of the critical decisions we face is how to host and serve large language models (LLMs). Whether you're using Spring AI or any other framework, response latency and developer ergonomics matter, especially in real-time scenarios like dynamic content enrichment.
In this post, we’ll walk through a project that compares two popular self-hosted LLM backends:
- Ollama: an all-in-one tool to run models like LLaMA2 or Mistral locally with blazing speed
- Docker Model Runner: an API-compatible wrapper for LLMs meant for containerized environments (like in Kubernetes or CI/CD setups)
We'll show you how to configure both in Spring AI, how they perform under different prompt complexities, and when it might be worth switching from Ollama to Docker Model Runner, or sticking with what you have.
The Project Setup
The code is a simple Spring Boot app configured with two REST-based LLM clients using Spring AI:
- ollamaClient pointing to the Ollama local endpoint (default: "http://localhost:11434")
- dockerClient pointing to the Docker Model Runner endpoint (default: "http://localhost:12434/engines")
Both are wired via "ChatService" and exposed through a REST endpoint where you can choose which backend to query:
- POST /chat/ollama for Ollama
- POST /chat/runner for Docker Model Runner
Benchmarking Script
The script model_comparison.py runs a series of prompts from simple fact retrieval to complex creative writing. For each, it logs response times for both backends.
Benchmark Results
Here’s the output we got when running the script in Apple Silicon M1 Pro computer:
Overall Model Performance:
Time (ms) ... Tokens/sec
count mean median min max ... mean median min max std
Model ...
Ollama 50 11982.18 4137.5 330 34216 ... 23.65 24.31 18.52 27.82 2.55
Runner 50 12872.06 3548.5 295 39886 ... 24.53 24.68 16.28 28.47 2.13
Per-Prompt Performance (Token Output Rate - tokens/sec):
mean median min max
Model Ollama Runner Ollama Runner Ollama Runner Ollama Runner
Prompt
Prompt 1 19.98 22.31 19.83 22.66 18.52 16.28 21.21 23.73
Prompt 2 23.03 23.25 23.51 22.68 20.95 20.60 24.60 25.09
Prompt 3 23.82 25.25 24.17 25.34 21.11 24.13 26.27 26.24
Prompt 4 26.95 27.29 27.19 27.33 25.63 25.85 27.82 28.47
Prompt 5 24.47 24.54 24.65 24.62 23.51 23.93 25.17 25.50
Speedup Factors (Runner vs Ollama - based on tokens/sec):
Prompt
Prompt 1 1.12
Prompt 2 1.01
Prompt 3 1.06
Prompt 4 1.01
Prompt 5 1.00
Interpretation
Consider both Ollama and Docker Model Runner as viable options since:
- Both systems demonstrate equivalent performance across most prompt types (token generation rates differ by only 0-12%)
- Performance differences fall within ranges that would be imperceptible to most users in real-world applications
- Each shows balanced advantages in specific contexts
Ollama might be preferable if:
- You're in local development or early prototyping
- You need slightly more efficient token usage for complex comparative tasks
- You're working with minimal configuration requirements
Consider Docker Model Runner if:
- You're moving toward cloud-native or Kubernetes deployments
- You need API compatibility in environments that don't support Ollama's tooling directly
- You want integration with systems like Prometheus, CI/CD, or service mesh
- You prefer slightly faster response times for simple queries
Final Thoughts
The real power of this project is not just comparing LLMs, it’s showing how easy it is to plug in different backends with Spring AI and keep a common application contract. This is a powerful pattern for Alfresco developers experimenting with local vs cloud LLMs or evaluating hosted APIs vs self-managed models.
Resources
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.