cancel
Showing results for 
Search instead for 
Did you mean: 

Help extracting HTML metadata in WCM

ofwr
Champ in-the-making
Champ in-the-making
We're attempting to import an existing web site which contains metadata into Alfresco WCM.

I have imported a single .html file into a work space, extracted the meta data using the default action 'Extract Common Metadata' and moved it into a WCM folder.  Because of the number of files involved this method is not practical for the entire site - bulk import crashes and I can't see how to easily automate the process.

I read the 'XML Metadata Extractor Configuration for WCM' example on the wiki and have managed to import multiple .xml files extracting their metadata.  This appears to be the best method, however I can't see clear documentation or an example to do the same with .html.  Looking at the example there is a different method required and the Javadocs indicate that all other file types are within a different class hierarchy.

I'm assuming that the .html implementation should:
* register the HTML extractor in the avmMetadataExtracterRegistry
* construct a HtmlDocumentMetadataExtracter mapping the properties

I have created an example and imported .html files, but I have had no success with getting anything from the html extractor.

Help please.  Are there any examples of importing .html into WCM?
1 REPLY 1

pmonks
Star Contributor
Star Contributor
How large, both in terms of # of assets and total size, was your bulk import?  What happens when you import the content via CIFS or FTP?  If you're bulk loading into a DM space (not a Web Project), have you tried importing an ACP file instead (see http://wiki.alfresco.com/wiki/Export_and_Import#Alfresco_Content_Package_.28ACP.29_File_Format)?

In terms of metadata, web content in Alfresco is (with obvious exceptions such as static assets and application code) intended to be XML based, which doesn't require metadata extraction since XML is already a highly structured data format.  The intention is that HTML files are derived from the XML via renditioning templates, rather than the reverse (metadata is somehow coaxed out of a poorly structured format such as HTML).

With this in mind the content migration exercise becomes:
    1. import existing web site as-is into an Alfresco web project, to create a baseline version of the site
    2. go through the content inventorying process, identifying categories of pages and (as a result) building a list of potential content types
    3. drill down into each of the identified content types and enumerate the properties that make up each of those types (eg. a press release might have a title, publish date, source, summary blurb, body and zero or more associated images).
    4. implement each type as an Alfresco web form (XML schema) - see http://wiki.alfresco.com/wiki/Forms_Developer_Guide
    5. reverse-engineer XML files from your original HTML assets (imported in step 1) and upload them into the repository
    6. implement renditioning templates for each content type that generate most (or all) of the HTML files imported in step 1
While Alfresco provides value as soon as step 1 is complete (sandboxed content development, workflow, versioning, snapshots, deployment, etc.), some of its most valuable features (web forms, dynamic content querying, etc.) are not readily usable until the content is stored and managed as XML.
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.