As developers building AI-powered applications, one of the critical decisions we face is how to host and serve large language models (LLMs). Whether you're using Spring AI or any other framework, response latency and developer ergonomics matter, especially in real-time scenarios like dynamic content enrichment.
In this post, we’ll walk through a project that compares two popular self-hosted LLM backends:
We'll show you how to configure both in Spring AI, how they perform under different prompt complexities, and when it might be worth switching from Ollama to Docker Model Runner, or sticking with what you have.
The code is a simple Spring Boot app configured with two REST-based LLM clients using Spring AI:
Both are wired via "ChatService" and exposed through a REST endpoint where you can choose which backend to query:
The script model_comparison.py runs a series of prompts from simple fact retrieval to complex creative writing. For each, it logs response times for both backends.
Here’s the output we got when running the script in Apple Silicon M1 Pro computer:
Overall Model Performance:
Time (ms) ... Tokens/sec
count mean median min max ... mean median min max std
Model ...
Ollama 50 11982.18 4137.5 330 34216 ... 23.65 24.31 18.52 27.82 2.55
Runner 50 12872.06 3548.5 295 39886 ... 24.53 24.68 16.28 28.47 2.13
Per-Prompt Performance (Token Output Rate - tokens/sec):
mean median min max
Model Ollama Runner Ollama Runner Ollama Runner Ollama Runner
Prompt
Prompt 1 19.98 22.31 19.83 22.66 18.52 16.28 21.21 23.73
Prompt 2 23.03 23.25 23.51 22.68 20.95 20.60 24.60 25.09
Prompt 3 23.82 25.25 24.17 25.34 21.11 24.13 26.27 26.24
Prompt 4 26.95 27.29 27.19 27.33 25.63 25.85 27.82 28.47
Prompt 5 24.47 24.54 24.65 24.62 23.51 23.93 25.17 25.50
Speedup Factors (Runner vs Ollama - based on tokens/sec):
Prompt
Prompt 1 1.12
Prompt 2 1.01
Prompt 3 1.06
Prompt 4 1.01
Prompt 5 1.00
Consider both Ollama and Docker Model Runner as viable options since:
Ollama might be preferable if:
Consider Docker Model Runner if:
The real power of this project is not just comparing LLMs, it’s showing how easy it is to plug in different backends with Spring AI and keep a common application contract. This is a powerful pattern for Alfresco developers experimenting with local vs cloud LLMs or evaluating hosted APIs vs self-managed models.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.