cancel
Showing results for 
Search instead for 
Did you mean: 

Files imported via CIFS not indexed correctly by lucene

useeliger
Champ in-the-making
Champ in-the-making
I have found out that files (for example PDFs) are not indexed correctly by lucene.

When I import the same file via the Alfresco Web interface, the file is indexed correct and I can search over all the content.

Is this a known issue?
15 REPLIES 15

useeliger
Champ in-the-making
Champ in-the-making
Really no one else you has this problem?

Perhabs I did not described my problem correct. I will try again:

When I add content to Alfresco using the Web-Client (add content) than the document is correct indexed by Lucene. I can search over all words in the document and I will find my test document.

But when I import a copy of the same document to the same space in Alfresco using CIFS, than the document is not being indexed correct by Lucene.

My test: I search for the word 'Paris' - I will find only the document I have imported via the Web-Client. But when I search for 'video' I will find both.

If you want I can send my test PDFs to you, so that you can test by yourself.

rivarola
Champ on-the-rise
Champ on-the-rise
Hello,

What is the size of you PDF document in Alfresco when you import it through CIFS ? On some platforms it seems there is a bug and the file size is zero. As a consequence many operations fail on them.

useeliger
Champ in-the-making
Champ in-the-making
Hi,

both files have exact the same size (144 KB). So that's not the problem.

Again: also the file which I import through CIFS is indexed by Lucene, but not in the same way as the one which I import through Web-Client.

I have 2 similar files (except the file name AlfrescoLuceneTestdocument.pdf/AlfrescoLuceneTestdocument - Kopie.pdf). The content is exact the same.

Both documents contain the same words. But when I search for the word 'Paris', I will find only the file imported through Web-Client.
When I search for the word 'video', I will find both.

rivarola
Champ on-the-rise
Champ on-the-rise
when I search for the word 'Paris', I will find only the file imported through Web-Client.
When I search for the word 'video', I will find both.

Paris… video…
Hum… Does your document contains Paris Hilton video ? Alfresco may filter those items :lol:

As nobody her can explain this behaviour, you may open a JIRA issue to raise the problem to Alfresco engineer.

andy
Champ on-the-rise
Champ on-the-rise
Hi

It could be that the files generate different tokens (by default only the first 10,000 are used). This can be increased in the config. If the document structure is different it could be the pdf->text transformation does not work. Or uses a different route the second time, and thus produces a different result.

Are you using Open Office? What difference does it make with and without it?

Are the files loaded up in the same locale as you search? Locale affects tokenisation for indexing and search.

Add these docs in reverse order and see what happens.

Andy

useeliger
Champ in-the-making
Champ in-the-making
I did the same test also with the Enterprise Edition 2.1 - the same 'error'.

Let me describe the procedure once again:
I use a simple MS Word 2003 document, one page about 200 words. As you can see not really a complex document.
When I upload this document (I can give it to you if you want) with the web-client, all word in the document are getting indexed correct.
When I upload the same document with CIFS not all words get indexed.

I think this is a serious problem and should be under investigation by Alfresco.

vinodkumar
Champ in-the-making
Champ in-the-making
I am using alfresco 2.1 CE, I have the same problem.  Files loaded through CIFS are not indexed.  If I upload the file using webclient or php api it is indexed.  I have tested with a word document with approx 1000 words. 

Please give pointers to solve this issue. Thanks in advance for your help.

Thanks,
Vinoda Kumar S.

useeliger
Champ in-the-making
Champ in-the-making
This problem is caused by different locales. So if the locale of the webclient is en_us and the locale of CIFS is de_at (this was my environement) than only the documents with locale en_us get indexed correct.

In other words: if you use a different language on you Windows clients than you use with Alfresco Webclient, than you will face this problem.

vinodkumar
Champ in-the-making
Champ in-the-making
I am using webclient and CIFS on the same windows box so the locale will be same right? 

Thanks,
Vinod