Enterprise content is scattered. Your Alfresco repository holds thousands of contracts, reports, and policies. Your Nuxeo instance has another set. Neither talks to the other, and neither can answer a question like "summarize all the supplier agreements we reviewed last quarter", not because the content is missing, but because traditional keyword search doesn't understand meaning, and RAG systems built for the cloud weren't designed with your ACLs in mind.
Content Lake is a PoC that shows a different path: a fully on-premises, permission-aware, multi-source semantic search and RAG platform built on hxpr -- the Hyland Content Lake platform. This post walks through what hxpr is, what the PoC demonstrates, and how you can run it yourself in under 10 minutes.
hxpr is the storage and query engine at the heart of Hyland's Content Intelligence Cloud (CIC). From a developer's perspective, it does three things:
Within the broader Hyland CIC ecosystem, hxpr sits between Knowledge Enrichment (content processing) and Knowledge Discovery (RAG and agentic Q&A). The Content Lake PoC shows how to use hxpr directly with Alfresco and Nuxeo as sources, either fully on-premises or in a hybrid configuration.
You could point a generic RAG platform at your Alfresco repository. Several exist. But they weren't built for ECM. Three features drive that gap:
1. Live event-driven ingestion. When a document is uploaded, moved, or has its permissions changed in Alfresco, an ActiveMQ Event2 message fires. The Content Lake live ingester consumes it and re-indexes hxpr within seconds. No cron jobs, no polling lag, no scheduled resyncs. A generic RAG connector cannot replicate this because it has no awareness of the Alfresco event bus.
2. Permission-aware search at the ACL level. hxpr stores the full Alfresco ACL for every document -- individual users, groups, GROUP_EVERYONE, deny entries, all of it -- namespaced to the source instance (john_#_alfresco-prod). At query time, hxpr evaluates those ACLs against the authenticated user's actual group memberships, server-side, before returning a single result. There is no way to leak a document to an unauthorized user through the search API. Generic RAG systems either ignore permissions entirely or approximate them with broad role-based index scoping.
3. Scope control via the content model. Business users decide what gets indexed by applying the cl:indexed aspect to a folder directly from the Alfresco UI (or the Content Lake sidebar extension). No configuration files, no API calls, no developer intervention. Apply cl:excludeFromLake to opt out any file or subtree. The ingester discovers scope from the content model, not from hardcoded paths.
The PoC lives across five GitHub repositories, each with its own role:
| Repo | Role |
|---|---|
| content-lake-app | Java ingestion pipeline and RAG service (the core logic) |
| content-lake-app-deployment | Docker Compose stack that wires everything together |
| alfresco-content-lake-ui | ACA/ADW extension: semantic search panel + RAG chat sidebar |
| content-lake-app-ui | Standalone demo UI with Alfresco + Nuxeo dual auth |
| nuxeo-deployment | Local Nuxeo + PostgreSQL stack for multi-source testing |
The deployment repo is your entry point for running the full stack. It builds all Java services directly from GitHub via Docker BuildKit, no local checkouts required.
Understanding the pipeline is key to tuning and extending it. Content Lake uses a two-phase sync model that balances speed with completeness.
When a document enters scope (via batch discovery or a live event), the ingester immediately writes its metadata and ACLs to hxpr:
PENDINGUsers can query by metadata and see the document appear in hxpr almost instantly, even before the content is extracted.
An async worker picks up the queued document and processes it:
ConversionService) to get plain text from PDFs, Word docs, and other formatsPENDING → INDEXEDEvery write is guarded by a source_modifiedAt timestamp comparison. If hxpr already has a version as new as the source, the write is skipped. This means batch and live ingesters can run simultaneously without producing duplicates, and re-running a batch sync is always safe.
The four interfaces in content-lake-spi are the entire contract between a content source and the pipeline:
// Universal document representation
public record SourceNode(
String nodeId, String sourceId, String sourceType,
String name, String path, String mimeType,
Instant modifiedAt, boolean folder,
Set<String> readPrincipals,
Map<String, Object> sourceProperties) {}
// Read content from the source
public interface ContentSourceClient {
SourceNode getNode(String nodeId);
List<SourceNode> getChildren(String nodeId);
InputStream downloadContent(String nodeId);
}
// Extract plain text from a binary
public interface TextExtractor {
boolean supports(String mimeType);
String extractText(InputStream content, String mimeType);
}
// Decide what gets ingested
public interface ScopeResolver {
boolean isInScope(SourceNode node);
boolean shouldTraverse(SourceNode node);
}
Implement these four interfaces and you have a new content source. The chunking, embedding, ACL storage, and RAG query layers are untouched. This is how Alfresco and Nuxeo coexist in the same pipeline today, and how OnBase, SharePoint, or any other ECM could be added.
Once content is indexed in hxpr, the RAG service exposes three query modes. All are permission-filtered -- the results you get back are the results you're authorized to see.
Find documents by meaning, not by keywords. Useful when you don't know the exact terms.
curl -s -u alice:password \
-H "Content-Type: application/json" \
-X POST http://localhost/api/rag/search/semantic \
-d '{
"query": "supplier agreement renewal terms",
"topK": 5,
"minScore": 0.6
}'
Response includes ranked chunks with source metadata and a deep link back to the document in Alfresco or Nuxeo:
{
"resultCount": 2,
"results": [
{
"rank": 1,
"score": 0.87,
"chunkText": "The renewal clause on page 3 states that either party may...",
"sourceDocument": {
"name": "Vendor-Contract-2025.pdf",
"sourceType": "alfresco",
"path": "/Company Home/Sites/legal/documentLibrary",
"openInSourceUrl": "http://localhost/share/page/document-details?nodeRef=..."
}
}
]
}
Fuses vector similarity with BM25 keyword matching using Reciprocal Rank Fusion (RRF). Better than pure semantic search for queries that mix concepts with precise terminology (product codes, legal clauses, part numbers).
curl -s -u alice:password \
-H "Content-Type: application/json" \
-X POST http://localhost/api/rag/search/hybrid \
-d '{
"query": "ISO 9001 quality audit findings 2025",
"strategy": "rrf",
"candidateCount": 20,
"maxResults": 5
}'
Ask a natural language question. The service retrieves the most relevant chunks across all indexed sources the user can access, passes them to the LLM as context, and returns a grounded answer with citations.
curl -s -u alice:password \
-H "Content-Type: application/json" \
-X POST http://localhost/api/rag/prompt \
-d '{
"question": "What were the key findings from the Q4 2025 financial audit?",
"sessionId": "session-alice-01",
"topK": 10,
"minScore": 0.5
}'
{
"answer": "The Q4 2025 audit identified three significant findings: a 12% variance in the EMEA cost centre, an unreconciled balance in accounts payable dating to September, and...",
"sessionId": "session-alice-01",
"sources": [
{
"name": "Q4-Audit-Report.pdf",
"sourceType": "alfresco",
"score": 0.91,
"chunkText": "The EMEA cost centre showed a 12% variance against forecast...",
"openInSourceUrl": "http://localhost/share/page/document-details?nodeRef=..."
}
]
}
Use the same sessionId across turns for multi-turn conversation. The service reformulates each follow-up question using the conversation history, so "expand on the second finding" works naturally.
For chat UIs, stream tokens progressively using Server-Sent Events:
curl -N -u alice:password \
"http://localhost/api/rag/chat/stream?question=What+changed+in+Q4&sessionId=session-alice-01"
The stream emits event: token messages with incremental text, followed by a final event: metadata message with the full response including sources.
The deployment repo handles everything. You need Docker Desktop with Docker Model Runner enabled (for local AI inference) and credentials for the hxpr build (GitHub Packages + Hyland Nexus).
git clone https://github.com/aborroy/content-lake-app-deployment.git
cd content-lake-app-deployment
# Pull the AI models (once, ~3 GB total)
docker model pull ai/mxbai-embed-large
docker model pull ai/qwen2.5
# Export hxpr build credentials
export MAVEN_USERNAME=<your-github-username>
export MAVEN_PASSWORD=<github-token-with-read:packages>
export NEXUS_USERNAME=<your-nexus-username>
export NEXUS_PASSWORD=<your-nexus-password>
# Start the Alfresco + hxpr + RAG stack
make up-alfresco
Once the stack is healthy (3-5 minutes), open http://localhost/aca/ to access Alfresco Content App with the Content Lake extension loaded. Apply the cl:indexed aspect to any folder from the sidebar to start ingestion. After 30 seconds, the documents in that folder are available for semantic search and RAG.
For a full Alfresco + Nuxeo stack:
git clone https://github.com/aborroy/nuxeo-deployment.git ../nuxeo-deployment
(cd ../nuxeo-deployment && docker compose up -d)
make up-full
To build from local source instead of pulling from GitHub (useful during active development):
make up-demo local
Content Lake is a PoC, and some edges are deliberately left rough:
ConversationMemoryStore is on the roadmap for production scenarios.cl:indexed content model approach is cleaner; a Nuxeo facet-based equivalent is planned.alice exists in both Alfresco and Nuxeo, she is treated as two separate principals. OAuth2/OIDC federation is a follow-up.Upcoming improvements include: hybrid retrieval for RAG chat, real reranking (replacing the current no-op), multi-query retrieval planning, production hardening (Resilience4j, OpenTelemetry, structured logging), and the open-source release of hxpr itself later in 2026.
Questions, issues, or contributions are welcome via GitHub Issues on any of the repositories above. We'll be discussing this work at Community Live 2026, come find us.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.