cancel
Showing results for 
Search instead for 
Did you mean: 
angelborroy
Community Manager Community Manager
Community Manager

Enterprise content is scattered. Your Alfresco repository holds thousands of contracts, reports, and policies. Your Nuxeo instance has another set. Neither talks to the other, and neither can answer a question like "summarize all the supplier agreements we reviewed last quarter", not because the content is missing, but because traditional keyword search doesn't understand meaning, and RAG systems built for the cloud weren't designed with your ACLs in mind.

Content Lake is a PoC that shows a different path: a fully on-premises, permission-aware, multi-source semantic search and RAG platform built on hxpr -- the Hyland Content Lake platform. This post walks through what hxpr is, what the PoC demonstrates, and how you can run it yourself in under 10 minutes.

Note: Content Lake App is demonstration and educational code, not an official Hyland product. It serves as a reference implementation showcasing architectural patterns for building on hxpr. The hxpr platform will be released as open source on GitHub later in 2026.

What is hxpr?

hxpr is the storage and query engine at the heart of Hyland's Content Intelligence Cloud (CIC). From a developer's perspective, it does three things:

  1. Stores documents with inline vector embeddings. Each document lives in MongoDB with its extracted text, chunked content, and 1024-dimensional vector embeddings stored alongside structured metadata. OpenSearch indexes the vectors for fast similarity queries.
  2. Exposes a unified REST API for semantic, hybrid, and RAG queries. You POST a natural language question, hxpr runs a vector similarity search (optionally fused with BM25 keyword matching), and returns ranked document chunks with metadata and deep links back to the source.
  3. Enforces permissions server-side at query time. Every document ingested into hxpr carries the ACLs it had in the source system. When a user queries, hxpr evaluates their identity and group memberships against those stored ACLs and returns only the documents they can access -- across all sources simultaneously. No application-level security code required.

Within the broader Hyland CIC ecosystem, hxpr sits between Knowledge Enrichment (content processing) and Knowledge Discovery (RAG and agentic Q&A). The Content Lake PoC shows how to use hxpr directly with Alfresco and Nuxeo as sources, either fully on-premises or in a hybrid configuration.

Why ECM-native RAG beats generic RAG for Alfresco developers

You could point a generic RAG platform at your Alfresco repository. Several exist. But they weren't built for ECM. Three features drive that gap:

1. Live event-driven ingestion. When a document is uploaded, moved, or has its permissions changed in Alfresco, an ActiveMQ Event2 message fires. The Content Lake live ingester consumes it and re-indexes hxpr within seconds. No cron jobs, no polling lag, no scheduled resyncs. A generic RAG connector cannot replicate this because it has no awareness of the Alfresco event bus.

2. Permission-aware search at the ACL level. hxpr stores the full Alfresco ACL for every document -- individual users, groups, GROUP_EVERYONE, deny entries, all of it -- namespaced to the source instance (john_#_alfresco-prod). At query time, hxpr evaluates those ACLs against the authenticated user's actual group memberships, server-side, before returning a single result. There is no way to leak a document to an unauthorized user through the search API. Generic RAG systems either ignore permissions entirely or approximate them with broad role-based index scoping.

3. Scope control via the content model. Business users decide what gets indexed by applying the cl:indexed aspect to a folder directly from the Alfresco UI (or the Content Lake sidebar extension). No configuration files, no API calls, no developer intervention. Apply cl:excludeFromLake to opt out any file or subtree. The ingester discovers scope from the content model, not from hardcoded paths.

Key benefits at a glance

  • Permission-aware by design: ACL enforcement is built into the hxpr query pipeline, not your application code
  • On-premises AI: Complete stack runs in your data center. No content leaves your infrastructure. Docker Model Runner for dev, vLLM + TEI on GPU for production
  • Multi-source without glue code: Add any content source by implementing four Java interfaces. The chunking, embedding, and RAG layers stay unchanged
  • Incremental onboarding: Two-phase ingestion: metadata appears in hxpr in milliseconds; text extraction and embeddings complete asynchronously in the background
  • Open and extensible: Apache 2.0 license across all five repos. hxpr itself will be open-sourced in 2026
  • 30 seconds document-to-searchable, 2-5 seconds question-to-answer (typical business documents on standard hardware)

The Content Lake PoC

The PoC lives across five GitHub repositories, each with its own role:

Repo Role
content-lake-app Java ingestion pipeline and RAG service (the core logic)
content-lake-app-deployment Docker Compose stack that wires everything together
alfresco-content-lake-ui ACA/ADW extension: semantic search panel + RAG chat sidebar
content-lake-app-ui Standalone demo UI with Alfresco + Nuxeo dual auth
nuxeo-deployment Local Nuxeo + PostgreSQL stack for multi-source testing

The deployment repo is your entry point for running the full stack. It builds all Java services directly from GitHub via Docker BuildKit, no local checkouts required.

The ingestion pipeline

Understanding the pipeline is key to tuning and extending it. Content Lake uses a two-phase sync model that balances speed with completeness.

Phase 1. Metadata (milliseconds)

When a document enters scope (via batch discovery or a live event), the ingester immediately writes its metadata and ACLs to hxpr:

  • Document name, path, MIME type, modified timestamp
  • Full ACL expansion: user principals, group memberships, deny entries, all namespaced by source instance
  • Sync status set to PENDING

Users can query by metadata and see the document appear in hxpr almost instantly, even before the content is extracted.

Phase 2. Content extraction and embedding (2-10 seconds/doc)

An async worker picks up the queued document and processes it:

  1. Text extraction -- Calls Alfresco Transform Core AIO (or Nuxeo's ConversionService) to get plain text from PDFs, Word docs, and other formats
  2. Noise reduction -- Normalizes whitespace, removes OCR artifacts, preserves semantic structure
  3. Chunking -- Splits into ~500-token chunks with 50-token overlap, respecting paragraph boundaries
  4. Asymmetric embedding -- Documents embedded as-is; queries prefixed with a task instruction to improve retrieval quality
  5. Storage -- 1024-dimensional vectors (mxbai-embed-large) written inline into the hxpr document; OpenSearch indexes them for fast kNN search
  6. Status update -- PENDINGINDEXED

Idempotency

Every write is guarded by a source_modifiedAt timestamp comparison. If hxpr already has a version as new as the source, the write is skipped. This means batch and live ingesters can run simultaneously without producing duplicates, and re-running a batch sync is always safe.

SPI: the extensibility pattern

The four interfaces in content-lake-spi are the entire contract between a content source and the pipeline:

// Universal document representation
public record SourceNode(
    String nodeId, String sourceId, String sourceType,
    String name, String path, String mimeType,
    Instant modifiedAt, boolean folder,
    Set<String> readPrincipals,
    Map<String, Object> sourceProperties) {}

// Read content from the source
public interface ContentSourceClient {
    SourceNode getNode(String nodeId);
    List<SourceNode> getChildren(String nodeId);
    InputStream downloadContent(String nodeId);
}

// Extract plain text from a binary
public interface TextExtractor {
    boolean supports(String mimeType);
    String extractText(InputStream content, String mimeType);
}

// Decide what gets ingested
public interface ScopeResolver {
    boolean isInScope(SourceNode node);
    boolean shouldTraverse(SourceNode node);
}

Implement these four interfaces and you have a new content source. The chunking, embedding, ACL storage, and RAG query layers are untouched. This is how Alfresco and Nuxeo coexist in the same pipeline today, and how OnBase, SharePoint, or any other ECM could be added.

What you can build: API walkthrough

Once content is indexed in hxpr, the RAG service exposes three query modes. All are permission-filtered -- the results you get back are the results you're authorized to see.

Semantic search

Find documents by meaning, not by keywords. Useful when you don't know the exact terms.

curl -s -u alice:password \
  -H "Content-Type: application/json" \
  -X POST http://localhost/api/rag/search/semantic \
  -d '{
    "query": "supplier agreement renewal terms",
    "topK": 5,
    "minScore": 0.6
  }'

Response includes ranked chunks with source metadata and a deep link back to the document in Alfresco or Nuxeo:

{
  "resultCount": 2,
  "results": [
    {
      "rank": 1,
      "score": 0.87,
      "chunkText": "The renewal clause on page 3 states that either party may...",
      "sourceDocument": {
        "name": "Vendor-Contract-2025.pdf",
        "sourceType": "alfresco",
        "path": "/Company Home/Sites/legal/documentLibrary",
        "openInSourceUrl": "http://localhost/share/page/document-details?nodeRef=..."
      }
    }
  ]
}

Hybrid search (vector + keyword)

Fuses vector similarity with BM25 keyword matching using Reciprocal Rank Fusion (RRF). Better than pure semantic search for queries that mix concepts with precise terminology (product codes, legal clauses, part numbers).

curl -s -u alice:password \
  -H "Content-Type: application/json" \
  -X POST http://localhost/api/rag/search/hybrid \
  -d '{
    "query": "ISO 9001 quality audit findings 2025",
    "strategy": "rrf",
    "candidateCount": 20,
    "maxResults": 5
  }'

RAG with multi-turn conversation

Ask a natural language question. The service retrieves the most relevant chunks across all indexed sources the user can access, passes them to the LLM as context, and returns a grounded answer with citations.

curl -s -u alice:password \
  -H "Content-Type: application/json" \
  -X POST http://localhost/api/rag/prompt \
  -d '{
    "question": "What were the key findings from the Q4 2025 financial audit?",
    "sessionId": "session-alice-01",
    "topK": 10,
    "minScore": 0.5
  }'
{
  "answer": "The Q4 2025 audit identified three significant findings: a 12% variance in the EMEA cost centre, an unreconciled balance in accounts payable dating to September, and...",
  "sessionId": "session-alice-01",
  "sources": [
    {
      "name": "Q4-Audit-Report.pdf",
      "sourceType": "alfresco",
      "score": 0.91,
      "chunkText": "The EMEA cost centre showed a 12% variance against forecast...",
      "openInSourceUrl": "http://localhost/share/page/document-details?nodeRef=..."
    }
  ]
}

Use the same sessionId across turns for multi-turn conversation. The service reformulates each follow-up question using the conversation history, so "expand on the second finding" works naturally.

Streaming responses (SSE)

For chat UIs, stream tokens progressively using Server-Sent Events:

curl -N -u alice:password \
  "http://localhost/api/rag/chat/stream?question=What+changed+in+Q4&sessionId=session-alice-01"

The stream emits event: token messages with incremental text, followed by a final event: metadata message with the full response including sources.

Getting started in 5 minutes

The deployment repo handles everything. You need Docker Desktop with Docker Model Runner enabled (for local AI inference) and credentials for the hxpr build (GitHub Packages + Hyland Nexus).

git clone https://github.com/aborroy/content-lake-app-deployment.git
cd content-lake-app-deployment

# Pull the AI models (once, ~3 GB total)
docker model pull ai/mxbai-embed-large
docker model pull ai/qwen2.5

# Export hxpr build credentials
export MAVEN_USERNAME=<your-github-username>
export MAVEN_PASSWORD=<github-token-with-read:packages>
export NEXUS_USERNAME=<your-nexus-username>
export NEXUS_PASSWORD=<your-nexus-password>

# Start the Alfresco + hxpr + RAG stack
make up-alfresco

Once the stack is healthy (3-5 minutes), open http://localhost/aca/ to access Alfresco Content App with the Content Lake extension loaded. Apply the cl:indexed aspect to any folder from the sidebar to start ingestion. After 30 seconds, the documents in that folder are available for semantic search and RAG.

For a full Alfresco + Nuxeo stack:

git clone https://github.com/aborroy/nuxeo-deployment.git ../nuxeo-deployment
(cd ../nuxeo-deployment && docker compose up -d)
make up-full

To build from local source instead of pulling from GitHub (useful during active development):

make up-demo local

Limitations and what's next

Content Lake is a PoC, and some edges are deliberately left rough:

  • Conversation memory is in-memory only. Pod restarts lose session history. A Redis or database-backed ConversationMemoryStore is on the roadmap for production scenarios.
  • Nuxeo scope is configuration-driven, not facet-driven. The Alfresco cl:indexed content model approach is cleaner; a Nuxeo facet-based equivalent is planned.
  • No cross-source identity unification. If alice exists in both Alfresco and Nuxeo, she is treated as two separate principals. OAuth2/OIDC federation is a follow-up.
  • RAG chat uses semantic-only retrieval. The hybrid search endpoint exists but the chat path hasn't been switched to it yet. That's the highest-priority quality improvement on the roadmap.
  • No OAuth2 yet. Authentication is Basic Auth (username/password or Alfresco tickets). OAuth2/OIDC is planned.

Upcoming improvements include: hybrid retrieval for RAG chat, real reranking (replacing the current no-op), multi-query retrieval planning, production hardening (Resilience4j, OpenTelemetry, structured logging), and the open-source release of hxpr itself later in 2026.

Resources

Questions, issues, or contributions are welcome via GitHub Issues on any of the repositories above. We'll be discussing this work at Community Live 2026, come find us.