Hyland Connect

ofwr · ‎01-14-2008

We're attempting to import an existing web site which contains metadata into Alfresco WCM.

I have imported a single .html file into a work space, extracted the meta data using the default action 'Extract Common Metadata' and moved it into a WCM folder. Because of the number of files involved this method is not practical for the entire site - bulk import crashes and I can't see how to easily automate the process.

I read the 'XML Metadata Extractor Configuration for WCM' example on the wiki and have managed to import multiple .xml files extracting their metadata. This appears to be the best method, however I can't see clear documentation or an example to do the same with .html. Looking at the example there is a different method required and the Javadocs indicate that all other file types are within a different class hierarchy.

I'm assuming that the .html implementation should:
* register the HTML extractor in the avmMetadataExtracterRegistry
* construct a HtmlDocumentMetadataExtracter mapping the properties

I have created an example and imported .html files, but I have had no success with getting anything from the html extractor.

Help please. Are there any examples of importing .html into WCM?

pmonks · ‎01-15-2008

How large, both in terms of # of assets and total size, was your bulk import? What happens when you import the content via CIFS or FTP? If you're bulk loading into a DM space (not a Web Project), have you tried importing an ACP file instead (see http://wiki.alfresco.com/wiki/Export_and_Import#Alfresco_Content_Package_.28ACP.29_File_Format)?

In terms of metadata, web content in Alfresco is (with obvious exceptions such as static assets and application code) intended to be XML based, which doesn't require metadata extraction since XML is already a highly structured data format. The intention is that HTML files are derived from the XML via renditioning templates, rather than the reverse (metadata is somehow coaxed out of a poorly structured format such as HTML).

With this in mind the content migration exercise becomes:

http://wiki.alfresco.com/wiki/Forms_Developer_Guide

While Alfresco provides value as soon as step 1 is complete (sandboxed content development, workflow, versioning, snapshots, deployment, etc.), some of its most valuable features (web forms, dynamic content querying, etc.) are not readily usable until the content is stored and managed as XML.

Hyland Connect

Help extracting HTML metadata in WCM