cancel
Showing results for 
Search instead for 
Did you mean: 

How does Alfresco's content indexing work?

siquser
Champ in-the-making
Champ in-the-making
We uploaded MSWord and MSExcel documents.  When I search for the text that are within these documents, SEARCH does not show any result

Not that we are saying SEARCH does not work, b'cos we have tested SEARCH for TEXT data & it has worked in the past, and also we have searched for text from the MSWord / MSExcel file in the past & it has worked for us.

What we are un-sure is, how long does it take for the INDEXING server to kick-in once the file is uploaded. In our case it was a very small file & data within the file is very minimal, still SEARCH does not feed the result & we have been waiting for 20-30 minutes since the time we uploaded the file.  We grabbed the content of this MSWord file & uploaded the content as TEXT file & then searched, the result was instanteneous.

Question:  Is there any configuration, that says index the file right-away or index every <n> minute, that we can tweak?
30 REPLIES 30

javauser007
Champ in-the-making
Champ in-the-making
hi Mike,
I did the same which u specified above…
But still no luck..
is this a bug in alfresco ?

fselendic
Champ in-the-making
Champ in-the-making
Tried several docx files with Labs 3.0 stable, OO3.0portable, and both config changes mentioned here.
Works like charm, files are indexed and can be searched, thumbnail in document library is created, files can be previewed in Flash document preview component.

javauser007
Champ in-the-making
Champ in-the-making
Did u able to search the content of docx files (not the name of the file)…..?
If, S plz post what changes u made in configuration files….

Thanks!!!

fselendic
Champ in-the-making
Champ in-the-making
Yes, I can search content of docx file.

I downloaded 3.0-Stable on Windows, and in default configuration it didn't work, no thumbnails were created, and docx file couldn't be previewed in Share Document previewer.

Then I basically did what is proposed in this thread;

in openoffice-document-formats.xml i added:

<document-format><name>Microsoft Word 2007</name>
<family>Text</family>
<mime-type>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime-type>
<file-extension>docx</file-extension>
<export-filters>
<entry><family>Text</family><string>MS Word 2007</string></entry>
</export-filters>
</document-format>

and, like MikeH suggested, in content-services-context.xml i added:

<bean id="extracter.OpenOffice"    class="org.alfresco.repo.content.metadata.OpenOfficeMetadataExtracter"    parent="baseMetadataExtracter" >
   <property name="connection">
      <ref bean="openOfficeConnection" />
   </property>


   <property name="supportedMimetypes">
   <list>
        <value>application/msword</value>
      <value>application/vnd.excel</value>
      <value>application/vnd.powerpoint</value>
      <value>application/vnd.openxmlformats-officedocument.wordprocessingml.document</value>
      <value>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</value>
      <value>application/vnd.openxmlformats-officedocument.presentationml.presentation</value>
</list>
   </property>


</bean>

In content-services-context.xml, only the part from <property name="supportedMimetypes"> to </property> has to be inserted, bean definition is already there.

Restarting Alfresco, adding several more docx files, and everything works great. Don't know which one actually fixed it, not sure if just updating one of the files, openoffice-document-formats.xml or content-services-context.xml, would work.

mikeh
Star Contributor
Star Contributor
The first one is for the thumbnails and preview, the second is for the metadata.

Thanks,
Mike

fselendic
Champ in-the-making
Champ in-the-making
The first one is for the thumbnails and preview, the second is for the metadata.

Thanks,
Mike

Hi Mike

Which one fixes actuall content indexing?  Smiley Very Happy
And, is there any particular reason why configs for new MS formats aren't in there by default?

mikeh
Star Contributor
Star Contributor
The second one is for indexing. It's an oversight - they'll be in a later release / service pack.

Mike

t_broyer
Champ in-the-making
Champ in-the-making
The second one is for indexing.

Well, I'd rather say that both are for indexing: the first one is for the content (thus indexing for full-text search) and the second one for the metadata (thus indexing… the metadata: title, author, etc.)

Mike, could you confirm?

mikeh
Star Contributor
Star Contributor
Well, I'd rather say that both are for indexing: the first one is for the content (thus indexing for full-text search) and the second one for the metadata (thus indexing… the metadata: title, author, etc.)

Mike, could you confirm?
Yes, sorry - you're quite correct. For some reason I was only thinking about the metadata.

Mike

snow099
Champ in-the-making
Champ in-the-making
I know one software that can convert powerpoint to many foramts such as mpeg, mov, flash and so on, have a good try.PowerPoint Converter