cancel
Showing results for 
Search instead for 
Did you mean: 

Full-text search returns incomplete result list

db78
Champ in-the-making
Champ in-the-making
Hello,

we are connecting to the Alftesco repository (Community Edition 3.2r) using the Web-Service API. Currently we are facing problems, that a full-text search doesn't return a complete result list. We have several documents which for example contains the word "clogging". The search returns some documents containing this word, but other documents (in general PDF) are not in the result list. The missing documents are not scanned in, so it might not be an OCR problem. The search query we are firing is:

    PATH:"/app:company_home/cmSmiley Very Happyocuments/cm:Member//*" AND (@cm\:content.mimetype:application/* AND (TEXT:clogging)

Does anyone had a similiar problem with the full text search?

Cheers,
Daniel
8 REPLIES 8

loftux
Star Contributor
Star Contributor
You are probably excluding documents with the @cm\:content.mimetype:application/*  statement. Text files for example is text/plain.
Try removing that part, and see what you get.

db78
Champ in-the-making
Champ in-the-making
Hello Loftux,

thank you for your reply!

As I had mentioned before, we do have that problems with pdf files also, so it can't result from filtering the content by mime types application/*. Also, some pdf documents are found but some (expected ones) are not …

I am already thinking about some repository index configuration issues. Currently the configuration is default one. Do you have any other ideas what the cause can be?

Cheers,
Daniel

mrogers
Star Contributor
Star Contributor
Some PDF files are simply images wrapped up in the PDF format, In which case text extraction becomes difficult.

Indexing is a two stage process the first is to extract the text, then the next step is to index that plain text.   One simple test is to transform the PDF file to plain text in Alfresco and see if you get anything sensible.

db78
Champ in-the-making
Champ in-the-making
Hello mrogers,

thank's for the hint, but those documents we are expecting to find are not images. If you open the documents via Adobe Reader you can select the text. I am still thinking, that this might be a problem by the index, because we do have many documents in the repository (some Gigabyte).

Any other hint?  :roll:

Cheers,
Daniel

loftux
Star Contributor
Star Contributor
http://www.google.com/search?q=site:forums.alfresco.com+lucene+results+returned
http://www.google.com/search?q=site:forums.alfresco.com+lucene+results+analyzer
http://www.google.com/search?q=site:forums.alfresco.com+lucene+results+locale

One of these searches may give you a hint Smiley Wink

I think it may be due to the fact that some of the documents are indexed with a different locale than you expect. Try a wildcard search like clogg* and see if yo get what you expect. If so then most likely lucene has indexed the docs with different locales and that can give unexpected results.

db78
Champ in-the-making
Champ in-the-making
Hi Loftux,

thanks for your reply and idea. Is this possible if we are hosting only english written documents? The wildcard search "clog*" or "*clog*" do not work either.

The document I am searching for, also contains the words "fuel sulphur content". I tried a full text search for the exact phrase with no result (except the other documents which contains this string). I also tried "fuelsulphurcontent" in case the spaces are removed, but no result. And the document isn't a special one. It only contains text in a simple structure which can be selected and searched by the PDF reader program.

But(!), if I specifiy the path to the nearer space where the document is stored (e.g. PATH:"/app:company_home/cmSmiley Very Happyocuments/cm:Member/cm:Working_Group/cm:Marine//*" AND (@cm\:content.mimetype:application/* AND (TEXT:"clogging"))), the search is successful! But(!), only if I use one of the latest versions of Alfresco Labs! No result with r3.2.

Interesting … but I am lost  :?

Cheers,
Daniel

db78
Champ in-the-making
Champ in-the-making
Hi,

additionally, I re-checked the usage of the new Alfresco version, and to correct my last post, the document is also found when not specifing the space where it is located! So the conclusion is, that the cause is likely the old Alfresco version 3.2 we were using.

Anyway, thank you for your help!

Cheers,
Daniel

alf_kin
Champ in-the-making
Champ in-the-making
Same problem for me. I try the full-text-search on node browsing in /alfresco and the result set is not empty. If i try to performe same query through web-script the result-set is empty. Adding locales has worked for you?? This is my code:


sp.addStore(Repository.getStoreRef());

         sp.setLanguage(SearchService.LANGUAGE_LUCENE);
         sp.setQuery("PATH:\"/app:company_home/st:sites/cm:swsdp/cm:documentLibrary//*\" AND @cm\\:content.mimetype:application/* AND (" + queryString.trim() + ")");
         typesRecords = registry.getSearchService().query(sp);


where


queryString = TEXT:\"" + token + "\" "


this query return a result set only for some words, instead on alfresco node browser same query return result set not null…
any hint ?
thank you