ragauss

Confirmed Champ
Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
05-31-2013
09:09 AM
What is Metadata Embedding?
Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.
Alfresco does a fantastic job of handling metadata extraction through it's concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.
We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.
In some cases it's important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.
In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of
MetadataExtracters
, and are responsible for writing properties into content.How Does it Work?
The
MetadataEmbedder
interface has just two methods, isEmbeddingSupported
, and embed
.Rather than create an entirely separate registry for embedders, the
MetadataExtracterRegistry
was extended with a getEmbedder(String sourceMimetype)
method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You'd usually implement both in the same class anyway. Speaking of...AbstractMappingMetadataExtracter
now implements the MetadataEmbedder
interface and contains:- A
supportedEmbedMimetypes
collection that's used in theisEmbeddingSupported
call embedMapping
that defines the mapping from Alfresco properties to metadata fields- An
embedInternal
method to be overridden by extending classes
For classes extending
AbstractMappingMetadataExtracter
, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e. classpath:/x/y/z/MyExtracter.embed.properties
(note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?What About Tika?
'But that's still sooooo... abstract. How are we going to leverage Tika? It doesn't support embedding, does it?'
Well as a matter of fact it does, as of version 1.3 (TIKA-775).
The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our
TikaPoweredMetadataExtracter
builds on that and overrides the embedInternal
method defined in its parent AbstractMappingMetadataExtracter
to convert Alfresco properties to Tika metadata fields and passes that on to a Tika Embedder
's embed
method, which then passes back the new binary with the metadata embedded.How Can we Use Embedding?
Our shiny new Alfresco metadata embedder's embed method isn't very useful if we don't have an easy way to call it, so we've added a
ContentMetadataEmbedder
action executor which shows up as a standard 'Embed properties as metadata in content' action that can be used in a rule on a folder or executed in a workflow. (After 4.2.c you can find this in alfresco/extension/metadata-embedding-context.xml.sample
)So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!
tika-exiftool is wrapper for calls to the ExifTool command-line which contains a Tika
Parser
and Embedder
for image files.The Media Management module contains an example which brings all of this together with an extension of
TikaPoweredMetadataExtracter
that uses the Tika Embedder
defined in the tika-exiftool project to enable IPTC embedding in image files.We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop's file info.
Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.
What's Next?
We'll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?
Labels:
Tika-Embedding.png
83 KB
Embed-Flow.png
53 KB
7 Comments
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.