cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene search and content indexing in PDF documents

ricardoc-moreda
Champ in-the-making
Champ in-the-making
Hi everyone,

I'm having trouble getting reliable results in the Lucene search of Alfresco, in PDF documents.

Examples:
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*"

Results (14 rows)
Parent Node Name
_x0032_010 workspace: / / SpacesStore/837eda52-bc75-4fba-b78a-2a7e694b6542 workspace: / / SpacesStore/0ec3f10e-c165-4a6e-ac32-d61f6539af33
_x0030_2 workspace: / / SpacesStore/8f6b470c-5dc1-49af-b5a2-33f7653c6c03 workspace: / / SpacesStore/837eda52-bc75-4fba-b78a-2a7e694b6542
_x0031_8 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845 workspace: / / SpacesStore/8f6b470c-5dc1-49af-b5a2-33f7653c6c03
Manual_Alfresco.pdf workspace: / / SpacesStore/bb8bd6ec-5a5b-4a8e-9531-03f1f427b57b workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
Printing.pdf workspace: / / SpacesStore/871d8e51-d55e-428b-acf4-bcf0b2d093f5 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
ManualAlfresco.pdf workspace: / / SpacesStore/3b51672f-59df-497e-a799-cef3e0c3ca6b workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
_x0032_010 workspace: / / SpacesStore/3be9f159-d0e4-4ac8-a627-cd8a79858a65 workspace: / / SpacesStore/73c12541-b8c6-4aad-8b1c-0aed36325e84
_x0030_2 workspace: / / SpacesStore/d39a9a97-1ad6-4f3d-ab49-fa7f30ad2476 workspace: / / SpacesStore/3be9f159-d0e4-4ac8-a627-cd8a79858a65
_x0031_8 workspace: / / SpacesStore/d46ad196-9355-481c-941c-162d77d28346 workspace: / / SpacesStore/d39a9a97-1ad6-4f3d-ab49-fa7f30ad2476
Find_accessed_file_in_past_1_or_2_minutes.pdf workspace: / / SpacesStore/06e8d206-33f7-4a9e-9392-1bcc7beab0c8 workspace: / / SpacesStore/d46ad196-9355-481c-941c-162d77d28346
_x0032_010 workspace: / / SpacesStore/9d033cea-e913-4fc6-83e2-67bfca1efa0d workspace: / / SpacesStore/35f6551d-2e35-4275-b994-b73fd998f864
_x0030_2 workspace: / / SpacesStore/e177b105-c82d-4a77-b91c-b0597af95063 workspace: / / SpacesStore/9d033cea-e913-4fc6-83e2-67bfca1efa0d
_x0031_8 workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f workspace: / / SpacesStore/e177b105-c82d-4a77-b91c-b0597af95063
Printing_x0020__x0028_copy_x0029_.pdf workspace: / / SpacesStore/c1055e34-77c7-4f61-8b3c-fc7be662b69e workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f

Ignoring the results to folders (_x003 *), all documents listed here have properties with value "admin". Thus, if the indexing is good, they should appear in the search for that word.

However, with:
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*" AND (TEXT:*admin*)

Results (2 rows)
Parent Node Name
Printing.pdf workspace: / / SpacesStore/871d8e51-d55e-428b-acf4-bcf0b2d093f5 workspace: / / SpacesStore/15e04bc9-9389-4618-aba6-faac4b6bf845
Printing_x0020__x0028_copy_x0029_.pdf workspace: / / SpacesStore/c1055e34-77c7-4f61-8b3c-fc7be662b69e workspace: / / SpacesStore/301cca69-dc63-43e0-bf8c-1bde0f95aa2f

Only two documents appear!  :shock:

Based on what I read in

http://wiki.alfresco.com/wiki/Full-Text_Search_Configuration
and
http://forums.alfresco.com/en/viewtopic.php?f=4&t=23735&p=77638&hilit=+TEXT+lucene#p77638

I changed in models contentModel.xml and ccdraModel.xml the type of indexing to:

<index enabled="true">
   <atomic> false </ atomic>
   <stored> true </ stored>
   <tokenised> false </ tokenised>

</ index>

I did a full reindex.

However, problems persist.

And if I search for another string of existing metadata in a document, such as:
{http://www.empresa.pt/model/content/1.0}assuntoDocEntrada Ass2

In the research, nothing is returned.
Search Language:     lucene
Search:    PATH:"/app:company_home/cm:Empresa/cm:Expediente/cm://*" AND ( TEXT:*Ass2*)

Results (0 rows)
Parent Node Name

Note that before changes in models that I referred, the original code was:
<property name="cc:assuntoDocEntrada">
  <title>Assunto do documento de entrada</title>
  <type>d:text</type>
  <mandatory>true</mandatory>
  <index enabled="true">
    <atomic>false</atomic>
    <stored>false</stored>
    <tokenised>true</tokenised>

  </index>
</property>

Any idea?


Regards,


Ricardo Cardoso
2 REPLIES 2

ricardoc-moreda
Champ in-the-making
Champ in-the-making
Now I have in custom properties:
<index enabled="true">
  <atomic>true</atomic>
  <stored>false</stored>
  <tokenised>both</tokenised>
</index>

From five documents, I get three in Lucene searches, with full reindexing. One more than before.

My version is 3.2.0 (2039) schema 2019.

Could this be an issue?

ricardoc-moreda
Champ in-the-making
Champ in-the-making
For example, I uploaded two files for testing, with the same characteristics (size, PDF conversion engine, the application that converted document to PDF).
Both are placed in the same space.

I get the following:

Search: PATH:"/app:company_home/cm:Empresa/cm:EntradasPendentes/cm:Evora//*"

Results (2 rows)
Parent Node Name
actions-article.pdf workspace: / / SpacesStore/1e8a97a4-a7b7-4 … 08cd3b2fbc workspace: / / SpacesStore/d1822abb-4be2-4 … 602c2806f8
content-article.pdf workspace: / / SpacesStore/c724c44d-880b-4 … ca1b99dbe1 workspace: / / SpacesStore/d1822abb-4be2-4 … 602c2806f8

****
Search: PATH:"/app:company_home/cm:Empresa/cm:EntradasPendentes/cm:Evora//*" AND ( TEXT:*admin* )

Results (1 rows)
Parent Node Name
content-article.pdf workspace: / / SpacesStore/c724c44d-880b-4 … ca1b99dbe1 workspace: / / SpacesStore/d1822abb-4be2-4 … 602c2806f8

****
Search: PATH:"/app:company_home/cm:Empresa/cm:EntradasPendentes/cm:Evora//*" AND ( TEXT:admin )

Results (0 rows)
Parent Node Name

****

Note that both have the properties:
{http://www.alfresco.org/model/content/1.0}creator admin
{http://www.alfresco.org/model/content/1.0}modifier admin


In contentModel.xml:

<property name="cm:creator">
<title> Creator </ title>
<type> d: text </ type>
<protected> true </ protected>
<mandatory enforced="true"> true </ mandatory>
<index enabled="true">
<atomic> true </ atomic>
<stored> false </ stored>
<tokenised> both </ tokenised>
</ index>
</ property>

<property name="cm:modifier">
<title> Modifier </ title>
<type> d: text </ type>
<protected> true </ protected>
<mandatory enforced="true"> true </ mandatory>
<index enabled="true">
<atomic> true </ atomic>
<stored> false </ stored>
<tokenised> both </ tokenised>
</ index>
</ property>

It can be seen with the default properties of Alfresco, that Lucene search fails, as in the examples above. In this case, in metadata properties.

Someone has any idea what is wrong? Is there any settings that I should review?