i am Running Alfresco-Labs 3 Stable on Debian 4.0 x86_64 with Mysql5 and Tomcat Bundle.
Currently i am testing with Abby Finereader and create text below picture pdf's and saving them with the CIFS in Alfresco. Alfresco indexes some word's other's not, on what does that depend?
If you want to see what's indexed, you may use two ways: - use Luke (Lucene tool which helps you to see you index content) - see the txt files generated in the tomcat temp/Alfresco directory
We have the same problem and i found out after some tests that there is a problem when adding such a document by CIFS. When you add such an OCRed PDF via CIFS or when you add such a document via the Web-Client or an other client to Alfresco the search results are different.
The document added by CIFS can not found with the same search string like the same document added via the client - often (maybe allways) it helps when you make a wildcard search and when you e.g. search for vienna you have to use *vienna* to find the document. Also the search for phrases using e.g. "vacation in vienna" does not works with such PDF´s added by CIFS but the same search function works correct if this file was added via the web-client.
This is a very strange problem which causes a lot of confusion during the tests and use of the system because one of the most important things of such a ECM - "E" for enterprise should be to be able to find the documents you added to the repository - and not for xx% but allways for 100%.
- problem with language settings we use a german XP to access Alfresco CIFS and add PDF files via drag & drop and we use "english" as selected language for the Web-Client ? - problem with the PDF to text extraction - ……
Same problem with enterprise 3.0 and Community 3.2 version.
Any ideas how to solve this ? there seems to be some topics in the forum regarding this problem.
The idea is to see what's in the lucene index with Luke (http://www.getopt.org/luke/) when the document is added through the web interface and then when you add it through CIFS.
Have also a look in the metadata of the document through the Alfresco node explorer, and checks the field content (which contains something about the language which may be used by the indexer).