cancel
Showing results for 
Search instead for 
Did you mean: 
angelborroy
Community Manager Community Manager
Community Manager

OpenSearchCon Europe 2026 was relevant for teams building secure enterprise search and AI features on top of Alfresco, Nuxeo, and a shared Content Lake. The event was not just about generic vector search. Several sessions went directly into the problems we actually have in content platforms: permission-aware retrieval, multi-tenant collaboration, document parsing, chunking, embeddings, and operating a self-managed search stack without handing core data flows to external services.

This post summarizes the sessions that matter most for Alfresco and Nuxeo developers and maps them to practical design decisions. The emphasis is on what these talks mean for repository-backed content systems rather than on conference recap.

Why These Talks Matter to Our Stack

The Hyland deck used at the event framed the platform around a few concrete capabilities: AFTS, CMIS, content metadata, path-aware search, permissions, permission-aware retrieval, indexing, chunking, multi-tenancy, and AI-ready search. That is exactly the boundary where OpenSearch becomes more than a search engine:

  • It becomes the retrieval layer for repository content
  • It becomes the execution layer for semantic and hybrid search
  • It must preserve repository security semantics
  • It becomes part of the AI pipeline, not just the query pipeline

1. Index Authorization Is Not an Edge Case

The most important operational session for repository-backed systems was "Upcoming Changes for the OpenSearch Index Authorization Mechanisms" by Nils Bandener.

The core problem described in the talk is familiar: users rarely think in terms of physical OpenSearch indexes, but security still has to be enforced at index resolution time. Historically, the do_not_fail_on_forbidden behavior tried to make this easier by dropping unauthorized indexes from requests. That made dashboards and broad search requests more usable, but it also introduced ambiguity and inconsistent semantics.

The talk outlined a replacement model that extends existing request options such as ignore_unavailable and allow_no_indices into security-aware index resolution. In practice, this means:

  • Wildcard requests can resolve only to authorized indexes
  • Explicit requests for unauthorized indexes can still fail fast with 403 Forbidden
  • Callers get more control over whether unauthorized targets are ignored or treated as errors
  • The behavior is intended to be more consistent across APIs

For Alfresco and Nuxeo developers, the main takeaway is simple: if your content is partitioned across tenants, repositories, or derived indexes, index naming and index resolution are part of your security model. They are not just operational details.

The other major change in the talk was alias semantics. Under the new model, aliases must be granted privileges directly. OpenSearch should no longer silently break an alias into its backing indexes during evaluation. That matters for enterprise search designs where aliases are used to represent business collections, tenant views, or lifecycle-driven index groups. If we rely on aliases to express repository-facing abstractions, privilege assignment has to follow that abstraction as well.

2. Resource Sharing Is Different from Document ACLs, but Still Important

Two sources covered this theme: the conference session "Enabling Secure, Fine-Grained Resource Sharing for Team Collaboration in OpenSearch" and the OpenSearch blog post "Introducing resource sharing: A new access control model for OpenSearch".

This topic is easy to misunderstand. Resource sharing is not a replacement for document-level security on Alfresco or Nuxeo content. It is a security model for higher-level OpenSearch resources created by plugins, such as ML models, anomaly detectors, reports, and other platform objects.

The model introduced in OpenSearch 3.3 adds:

  • Resource ownership
  • Sharing with specific users or roles
  • Access levels such as read-only or read-write
  • Auditable sharing operations
  • Migration away from coarse backend-role visibility

For repository and Content Lake work, this matters in two places.

First, it gives a cleaner control-plane model for shared AI and analytics assets. If multiple teams reuse the same embedding model, search workflow, or monitoring resource, that sharing should not depend on broad backend-role leakage. It should be explicit.

Second, it draws a useful boundary: content authorization and platform-resource authorization are different problems. Repository ACLs determine who can see a document. Resource sharing determines who can use the OpenSearch-side artifacts that process, analyze, or present that content.

That separation is healthy. It avoids the common mistake of collapsing every access rule into one oversized role model.

3. Docling Is the Strongest Answer to "Why Not Just Extract Text?"

The most directly useful AI-ingestion session was "Integrating Docling With OpenSearch for Advanced RAG and Agentic Applications" by Cesar Berrospi Ramis.

The talk made the case that enterprise RAG fails early when document conversion is weak. PDFs are especially problematic because plain text extraction loses structure, page reading order, table meaning, image meaning, and section boundaries. Once that structure is lost, chunking quality drops, citations degrade, and hallucinations become more likely because the retrieval layer is feeding the model corrupted context.

Docling's contribution is a richer document-conversion pipeline:

  • Advanced PDF understanding
  • Layout analysis
  • Table structure extraction
  • OCR and document assembly
  • Export to structured outputs such as JSON, Markdown, and HTML
  • A unified DoclingDocument representation
  • Local execution for sensitive data

This is highly relevant to Alfresco and Nuxeo because a large share of enterprise content is not clean HTML or normalized JSON. It is scanned PDF, office documents, compound documents, and records with structure hidden in layout. If the Content Lake wants to be AI-ready, the ingestion layer must preserve that structure before chunking starts.

The talk also showed a practical target architecture:

  • Parse the document into a structured representation
  • Enrich and assemble it
  • Chunk for RAG
  • Index the chunks in OpenSearch
  • Use the same assets for search and downstream AI applications

One especially important point for enterprise deployments is locality. Docling can run locally, which is much easier to justify for regulated content than shipping raw documents to external parsing services. That aligns well with Alfresco and Nuxeo installations where data residency and controlled processing paths are hard requirements.

Alfresco Transform Services and DocFilters provide equivalent features to Docling and are highly integrated with Hyland projects

4. ML Commons Turns OpenSearch into a Self-Contained RAG Subsystem

The session "Building RAG With OpenSearch ML Plugins: From PDFs To Voice-Enabled Search" by Kushagra Sharma covered a more end-to-end implementation path.

The material was straightforward and practical:

  • Deploy embedding models directly in OpenSearch via ML Commons
  • Extract text from uploaded PDFs
  • Chunk documents
  • Generate embeddings inside the cluster
  • Store vectors in a knn_vector index
  • Run semantic retrieval
  • Assemble context for an LLM
  • Generate answers with citations

The strongest point for enterprise developers was not voice search. It was containment. The talk argued for a self-contained system where embedding generation and retrieval stay inside the OpenSearch environment rather than depending on an external embedding API for every request.

The talk used simple chunking values such as 500-word chunks with 50-word overlap. That is a good starting point, but repository content usually needs smarter boundaries than plain word counts. For real content systems, chunking should respect:

  • Headings and sections
  • Tables and captions
  • Page references
  • Repository metadata boundaries
  • Permission inheritance boundaries when they exist

That is where the Docling talk and the ML Commons talk fit together: better parsing should drive better chunking, and better chunking should drive better retrieval.

5. Domain-Adapted Neural Search Is the Missing Layer Between Vectors and Usable Enterprise Retrieval

The session "From Embeddings To Index: A Practitioner's Guide To Domain-Adapted Neural Search" by Samuel Herman was one of the most relevant titles on the schedule, but its slide deck was not part of the local downloaded materials. The summary in this section is therefore an informed interpretation based on the session title, the official event schedule, OpenSearch neural-search documentation, and Samuel Herman public repository and publication profile.

The important idea is that enterprise retrieval does not fail because vectors exist. It fails because generic embeddings are often too generic for the domain being searched.

In other words, producing embeddings is only the start. The hard part is how those embeddings are aligned with the domain, indexed, retrieved, and combined with metadata and keyword signals.

For teams building a Content Lake, this usually points toward:

  • Hybrid retrieval instead of vector-only retrieval.
  • Metadata-aware filtering before or alongside semantic ranking.
  • Evaluation on real repository queries, not only synthetic demos.
  • Careful model selection and possible fine-tuning or sparse/dense tuning for the corpus.
  • Explicit measurement of retrieval quality after ACL filtering, because security filters change relevance outcomes.

If I had to compress the lesson for repository developers into one sentence, it would be this: vector search is not a feature you "switch on"; it is a retrieval design exercise that starts with your domain model and security model.

What This Means for Alfresco, Nuxeo, and the Content Lake

Taken together, these talks describe a coherent direction for enterprise content search.

Ingestion Should Become Structure-Aware

Do not reduce ingestion to binary extraction plus plain text. Repository documents should go through a conversion stage that preserves layout, tables, headings, and provenance whenever possible. That is the baseline for better chunking and better citations.

Chunking Should Become Repository-Aware

Chunk boundaries should reflect document structure and repository semantics. A chunk should remain attributable to a source node, page or section, tenant, and ACL context.

Retrieval Should Become Hybrid and Policy-Aware

Dense retrieval is useful, but enterprise search still benefits from metadata filters, keyword matching, exact identifiers, path constraints, and permissions. The practical target is hybrid retrieval, not a vector-only stack.

Security Should Be Modeled at Two Layers

There is document access and there is platform resource access. Repository ACLs, index authorization, and alias semantics belong to the first category. Sharing dashboards, models, and detectors belongs to the second. Mixing them carelessly will produce operational pain.

AI Readiness Starts Before the LLM

Most of the quality gains described in these talks happen before answer generation: parsing, enrichment, chunking, indexing, filtering, and retrieval. That is good news for repository teams because these are controllable engineering concerns.

References

Conference and Schedule

Session Follow-Up

OpenSearch and Docling References

Closing

Thanks to the OpenSearchCon Europe organizers, the OpenSearch Software Foundation, and the Linux Foundation events team for putting together a conference that was genuinely useful for practitioners. It was a strong event for teams working on real enterprise search and AI problems, and we appreciated the chance to participate, learn, and represent Hyland in that conversation.