cancel
Showing results for 
Search instead for 
Did you mean: 

Content search in Alfresco

ss26
Champ in-the-making
Champ in-the-making

Hi,

I have a requirement to perform content search of INDD files in Alfresco. When I say content search, what I mean is, I have an INDD file which has images with some text superimposed on it. Any user should be able to search for the INDD file using the text present in the image. Is there any feature in Alfresco which can serve my requirement.

Thanks,

S

1 ACCEPTED ANSWER

kaynezhang
World-Class Innovator
World-Class Innovator

In my opinion
1. A better solution is to use a library to combine your original pdf with extracted ocr text file into a searchable pdf ,and save the pdf into repository,then it can be searched directlry
2. If you place the extracted text file and the original pdf separately, I think you can reimplement webscript /api/solr/textContent which is used to get the content for the node property as text during indexing. In your implementation ,for your kind of pdf documentation return your extracted text file directly。

View answer in original post

8 REPLIES 8

kaynezhang
World-Class Innovator
World-Class Innovator

Why not export your innd file to pdf in Apparently Adobe InDesign and save the pdf to alfresco.
Or you can create both formats and save them in alfresco.  pdf format can be saved as a rendition of innd format.

And integrate an OCR converter to ocr the pdf.

ss26
Champ in-the-making
Champ in-the-making

Thanks for your input.

Because of certain constraints, we cant use Adobe InDesign to convert INDD to pdf. Instead, we have used exiftool and ImageMagick to convert INDD to pdf.But this pdf , has text as image hence its not searchable.

So I m using OCR to extract the text from the pdf. But now, I need to add this text to the metadata of the file to make it searchable. Can you please  advise on how to do that?

kaynezhang
World-Class Innovator
World-Class Innovator

In my opinion
1. A better solution is to use a library to combine your original pdf with extracted ocr text file into a searchable pdf ,and save the pdf into repository,then it can be searched directlry
2. If you place the extracted text file and the original pdf separately, I think you can reimplement webscript /api/solr/textContent which is used to get the content for the node property as text during indexing. In your implementation ,for your kind of pdf documentation return your extracted text file directly。

ss26
Champ in-the-making
Champ in-the-making

Thank you kayne zhang !

I was able to extract the text and add to the  metadata using the webscript.