cancel
Showing results for 
Search instead for 
Did you mean: 

Does Alfresco Index E-mail Content?

dgildeh
Champ in-the-making
Champ in-the-making
Hi,

Need an answer to this question before I go and see a client this week so any help is much appreciated. The client saves their e-mails in outlook MSG format, and I have tried putting a test MSG file in Alfresco and trying to search on the emails content, but it doesn't seem to be. I also tried a plain text file and I couldn't search on the content for that either.

Does Alfresco index outlook MSG file contents? And what do I need to do to index the content of these files if not? Why didn't the plain text file contents get indexed? Do I need to set something up or do you have to wait a while for it to be indexed properly as its a background task?

Thanks,

David
8 REPLIES 8

rivetlogic
Champ on-the-rise
Champ on-the-rise
Alfresco will index any content that it can transform to UTF-8. If you want any content item to be full-text seachable, a transformation from that item's MIME type to UTF-8 must be registered with Alfresco. I suspect that MSG files are already transformable, and your problem is somewhere else (since plain text should work as well).

Regarding indexing, Alfresco will index content items (if you're using the default content model) as they get added to the repo within the same transaction. So indexing of the content is not done in the background by default.

If you'd like to learn more about when properties (content is a property) get indexed, refer to this article in the Wiki:
http://wiki.alfresco.com/wiki/Full-Text_Search_Configuration

Cheers.

–Sumer

paulhh
Champ in-the-making
Champ in-the-making
Hi

I recall someone saying that the MS message format is actually RTF, so adding the RTF mimetype for documents with .msg might actually do what you want.  Let us know if it works!

Cheers
Paul.

dgildeh
Champ in-the-making
Champ in-the-making
Thanks, I will give it a go and let you know the results.

johnsona
Champ in-the-making
Champ in-the-making
Is there any update to this?  I read in a RM document that v1.4 of Alfresco now extracts some meta-date from outlook msg files (to, from, subject).  Is Alfresco now able to decompile the msg OLE file to extract additional information?

For example, being able to full-text index the body of the email would be incredibly useful.  Extending this to indexing the attachments as well would be useful, but not as critical.

Cheers,

Al.

kevinr
Star Contributor
Star Contributor
Yes Alfresco uses the Apache POI library to deconstruct the Outlook email message OLE file format. Currently we only have a meta-data extractor class as you mention. We could (and should!) add the full-text extraction transformer based on the same code - it would not be hard to do but there wasn't time for it in 1.4.

If you fancy trying it, the code to look at is:
org.alfresco.repo.content.metadata.MailMetadataExtracter

Which shows how to extract fields (including the text body of the email message) from the email file. There are plenty of examples of text extractor classes in Alfresco (for Word, PDF etc.) that give a good starting point for adding your own.

Thanks,

Kevin

johnsona
Champ in-the-making
Champ in-the-making
Hi Kevin,

I'll take a look at it.

I've been trying to test the metadata extractor before I start, but can't see / search on the extracted metadata (or for that matter confirm that anything has been indexed at all apart from the .msg file name).  I'm running the 1.4 community release & have also tried building alfresco.war from svn HEAD.  Do I need to turn anything on to get the metadata extractor working?

Cheers,

al.

kevinr
Star Contributor
Star Contributor
The email meta-data fields are part of the standard content model but not displayed by default. If the extraction occured correctly then the following aspect will have been populated:

      <aspect name="cm:emailed">
         <title>Emailed</title>
         <properties>
            <property name="cm:originator">
               <title>Originator</title>
               <type>d:text</type>
            </property>
            <property name="cm:addressee">
               <title>Addressee</title>
               <type>d:text</type>
            </property>
            <property name="cm:addressees">
               <title>Addressees</title>
               <type>d:text</type>
               <multiple>true</multiple>
            </property>
            <property name="cm:subjectline">
               <title>Subject</title>
               <type>d:text</type>
            </property>
            <property name="cm:sentdate">
               <title>Sent Date</title>
               <type>d:datetime</type>
            </property>
         </properties>
      </aspect>
So you need to add the fields you require to your overriden client config to display them in the appopriate screens.

The meta-data extractor will only work on Outlook ole2 format .msg file documents.

Thanks,

Kevin

tajensen72
Champ in-the-making
Champ in-the-making
Keep in mind that e-mail volume for any but the smallest organizations will be very large.  Keeping up with the ingestion rate on that can be quite challenging.  I'm not familiar yet with Alfresco's architecture, but it is the kind of thing that caused Documentum to have to rethink their meta-data model and change search engine vendor.

Travis