Hyland Connect

jochen · ‎08-14-2006

Hello

Assuming that someone stores and indexes MS Office documents in Alfresco, I'd like to know how the quality of the index is. Some DMS are not really perfect in this respect.

Thanks for your help!

Regards,
Jochen

kevinr · ‎08-15-2006

Text is extracted from MS Office documents using Open Office server. It successfully extracts text from Word, PowerPoint and Excel. PDFBox is used to extract text from PDF documents. Text is extracted from HTML documents using the built in HTML->text support in the Java Swing library.

So the "quality" of extraction is directly related to the quality of those 3rd party libraries and services.

Thanks,

Kevin

Hyland Connect

Quality of Filters for MSOffice