cancel
Showing results for 
Search instead for 
Did you mean: 

Primitive search missing documents

mwildam
Champ in-the-making
Champ in-the-making
I have added a few documents to a test folder using CIFS.

Then I searched for those documents and discovered that e.g. JPGs containing the search term in the title do not show up in the search results.

Why? And how can this be changed?
10 REPLIES 10

mwildam
Champ in-the-making
Champ in-the-making
BTW: I used the Alfresco Web Client (JSF-Variant) - but anyway the problem seems not to be client specific because a query using RESTful web services returned the same resultset - with those files missing.

mdutoo
Champ on-the-rise
Champ on-the-rise
Hi mwildam

Did you try the Advanced search in the jsf client, by explicitly putting your title criterium in the title search field ?
You can test Lucene searches in an exact manner writing them in the Node Browser (accessible from the Admin panel) according to http://wiki.alfresco.com/wiki/Search .
If this didn't work, the title metadata would have been badly indexed, but that would be bizarre.

Regards,
Marc

mwildam
Champ in-the-making
Champ in-the-making
I made some more tests…

Nodebrowser finds 13 items, Alfresco Web Client finds 14, Alfresco Share 13.
Note: The test documents are in a folder that should be accessible for both clients. I think (as far as I have seen) just a script was the plus 1 in the Web Client.
However, There are some more that should match - I went there using CIFS and looked at the file names. I found 24 (!) documents having "Test" in the title and I just searched for "test". So here is a big difference in what is found and what is there.

I found out that it might depend on the file format or extension. I have a Test.eml that is also not found, further files of extension zip, txt, wav, msg, rtf, png, jpg are not found. Is it possible that there are only a few known filetypes and a simple search does only search in content and not in title?

Using advanced search with specifying title brings just the script and the folder "test" as result…

So neither the simple nor the advanced search works.

mwildam
Champ in-the-making
Champ in-the-making
I tested several ways of uploading a jpg - just to make sure that it is not "just" an issue with CIFS. - But nevertheless - no way is working.

mwildam
Champ in-the-making
Champ in-the-making
Discovered the following:

a) If you upload a document using the web client then it is indexed, if using CIFS then is not.

b) You must use *<searchcriteria>* in order to find the files containing test in the name.

mdutoo
Champ on-the-rise
Champ on-the-rise
Hi mwildam

a) surprising, but various causes have been discussed at http://forums.alfresco.com/en/viewtopic.php?f=3&t=9577
b) Also, tokenization may get into the picture. Since it works for you using wildcards (*), I guess you're not tokenizing. To enable tokenization, put quotes around your criterium and it should work the way you want, however this way wildcards (*?) won't work anymore. See the wiki Search page also.

Regards,
Marc

mwildam
Champ in-the-making
Champ in-the-making
I found out that when you upload - let's say a file named "luke.doc" (just to give an example) via CIFS then you need to use luke* when you search.

When you use the web client to upload you can use just luke.

However, behavior changes - sometimes when searching something else and then retrying, it does not work any more and you must use luke*.
Unfortunately this is not reproducable - it is just sometimes. I tried playing around with the browser locales (Firefox) and first it seemed to make a difference, currently I am quite sure it has nothing to do with it.

It seems now that after a server restart for the same document I need to use the asterix.

But I have a few documents in the repository that still - after server restart - do not need the Asterix in general. But examining I found out that in the file information of some of those test documents the word "Test" was also found in the title property of the file itself which has been automatically extracted when importing it using the web client.

OK, made further tests with another document. First had only the name of the document - in my Sample "Sepp sei kein Depp.doc", results:

Searched for "Sepp": Document found.
Searched for "kein": Document not found (thought, maybe a stoppword), ok
Searched for "Depp": Document not found, ?!?!?
Searched for "sei": Document found, ?!?!?!?!?

OK, now edited title property to "Sepp sei kein Depp", results:
Searched for "kein": Document found, aha: Maybe only title is correctly indexed…

Edited title property to "Sepp sei Depp", results:
Searched for "kein": Document found ?!?!?!??

Restarting Alfresco Server.

Searched for "kein": Document found ?!??!!?
Searched for other documents with variants of the name: None of those found where I had not put the name also in the title. So name seems not to be searchable after server restart.
Searched again for "kein": Previously found document not found any more. - AHA! Fulltext indexing of new documents seems to work much faster than updating of existing documents, mmhh, but never mind.

Edited Title of Document adding the word "Habakuk":
Searched for "habakuk": No document found - ok I have to wait maybe a while…
Searched for "sepp": Found several documents having "Sepp" in their name (but not in their title or content), but "Sepp.doc" can only be found using "Sepp*" (although none of those documents uploaded using CIFS).
… waited about 5 Minutes … Document still not found.
OK, I am restarting Alfresco Server…
Did not help, I am still waiting a while - maybe it takes longer, so I am waiting…
Waited about half an hour - no change in behavior.

Tested on 2 other installations on 2 different machines: One Machine shows same or similar (did not test all in detail) behavior the other seems to work. All run exactly the same version of Alfresco: Labs - v3.0.0 (Stable 1526).

…Correction - even the machine that seems to work starts to show similar behavior after Alfresco server restart.

So my conclusion: I do not have any idea how the logic is on indexing - in other words: It's simply not reliable!
And searching is a very basic feature of a DMS/ECM. A customer will cut my head of if this is not working. We are shortly before our first real project.


What shall I do?

mwildam
Champ in-the-making
Champ in-the-making
We have tried now on another installation (3rd or 4th one) where it works.

The most obvious difference is that the installation where it works runs on Linux (Debian) and the others where it doesn't work, run on Windows.
It also can be that all the other installations - where it does not work - have already had a previous version of Alfresco installed - although I think I have at least on my machine deleted everything (incl. alf_data).

So is it possible, that the problem is a Windows specific problem? I noticed that also Java version is slightly different (but always 1.6).

mdutoo
Champ on-the-rise
Champ on-the-rise
Hi mwildam

Search may seem simple, but is is actually a complex and tricky domain, as you found out.
A common problem is that analyzers and subsequently stemming depend on the locale 1. at indexing time (when the document is uploaded or edited) and 2. at access time.
Moreover, often locale specific analyzers are not as good as the english one, and this degrades further the perceived efficiency of search. Disabling analyzers is possible by disabling tokenization, however only exact search is possible afterwardsn which can be OK on a given metadata but not e.g. in full content.
Anyway, if you want to ascertain precisely the effectiveness of search in you context (locale, business, metadata, rights…), at some point you have to define sample documents and expected search behaviour, in order to start fine-tuning.

And I'm not even talking about fuzziness, orthograph, facets..

Regards,
Marc