cancel
Showing results for 
Search instead for 
Did you mean: 

Provide index text additional to file

eXtreme
Champ on-the-rise
Champ on-the-rise

Hello,

I am rather new to Alfresco and I have an interesting problem regarding document indexing.

I already have in place a system to scan documents and extract the content from them.

What I would want is to save the document in Alfresco and provide the already extracted text so that it can be indexed by that.

Is there any way to tell Alfresco or Solr to use the text provided by me for the index?

Thank you!

1 ACCEPTED ANSWER

sufo
Star Contributor
Star Contributor

Short answer on yor question #3: If you want to implement your own transformer (T-Engine), this is the place to read on: https://docs.alfresco.com/transform/concepts/extend-transforms.html

Some more thoughts:
If you can generate PDF document with recognized text content in it as an output from your scanning system (many scanning and OCR solutions do this - text over image or image over text). It is enough to upload these PDFs to Alfresco and it will provide text content to Solr for indexing. Alfresco uses Apache Tika for text extraction from various formats and there are also various other transformers. You can see what transformations are supported in your repository on this URL: /alfresco/service/mimetypes?mimetype=application/pdf#application/pdf (look for Transfomable to and text/plain in the details for your chosen mimetype).
Solr requests text content for fulltext indexing via this repository URL: /alfresco/s/api/solr/textContent?nodeId=xxx (more info on nodeId here nodeId <-> noderef) and then the transformation to text/plain happens in Alfresco.

View answer in original post

6 REPLIES 6

cristinamr
World-Class Innovator
World-Class Innovator

Hi and welcome!

Being honest, I think you are mixing concepts.

From one side you have your document in alfresco with data. For other side, you have an index created in Solr to return back this file in your UI. So, the document data is managed on Alfresco side, not solr.

What I would want is to save the document in Alfresco and provide the already extracted text so that it can be indexed by that

You will need to save the data in a custom property, for example. With that you can make available this information to Alfresco system. To reach that point you will need:

  1. Create your own custom content model (with your property to store the information)
  2. A webscript java backed that load the information with your system and save it into the document

I hope this can be helpful.

Cheers.

--
VenziaIT: helping companies since 2005! Our ECM products: AQuA & Seidoc

Hello and thank you for your response!

From what I understand you are suggesting:

  •  create a custom content model, with a property that contains the full document text, and set it accordingly
  •  in the context of Alfresco Share, create an advanced search field that can search in the new property
  •  when searching keywords, the document should appear

1. Are these new properties beeing sent by default to Solr for indexing? or are there extra steps to be done?

The problem that I can see with this approach is that when searcing for keywords, I can find the document but there would not be any highlight done to the actual keywords (searching in Alfresco Share)

What I am hoping to achieve is to tap in the actual document data extraction mechanism of Alfresco and provide myself the actual data.

2. One thing that I am not sure about is who is doing the data extraction? Alfresco or Solr?

  • does Alfresco extract the data from a document and then sends that data to be indexed to Solr?
  • or does Alfresco send the entire document (with or without properties) to Solr and Solr does all the data extraction and indexing work? (data extraction using Apache Tika)
  • from what I can gather, both are capable of doing data extraction and that confuses me Smiley Happy

3. Can I tap in that mechanism, with code or some API to actually do some of the work myself?

My end goal here is to provide an intuitive user experience regarding the search functionality, both with dicoverability and user friendliness (by showing highlighted serached text)

sufo
Star Contributor
Star Contributor

Short answer on yor question #3: If you want to implement your own transformer (T-Engine), this is the place to read on: https://docs.alfresco.com/transform/concepts/extend-transforms.html

Some more thoughts:
If you can generate PDF document with recognized text content in it as an output from your scanning system (many scanning and OCR solutions do this - text over image or image over text). It is enough to upload these PDFs to Alfresco and it will provide text content to Solr for indexing. Alfresco uses Apache Tika for text extraction from various formats and there are also various other transformers. You can see what transformations are supported in your repository on this URL: /alfresco/service/mimetypes?mimetype=application/pdf#application/pdf (look for Transfomable to and text/plain in the details for your chosen mimetype).
Solr requests text content for fulltext indexing via this repository URL: /alfresco/s/api/solr/textContent?nodeId=xxx (more info on nodeId here nodeId <-> noderef) and then the transformation to text/plain happens in Alfresco.

heiko_robert
Star Collaborator
Star Collaborator

Is there really a need to store the doc and extracted text independant?

If not you should go with PDFs already containing the text as already suggested. This should be the preferred way to go. You could store the original doc inside the PDF container to keep the original format for compliance reasons and there are nice tools which support that conversion.

If you really need to keep the original doc as it is stored in Alfresco (e.g. TIFs) the best way to go would be to implement a specific Alfresco transformer / t-engine which is clever enough to read and store the full text using it's own mechanism. When solr requests the text from Alfresco to create the index entry the transformer should know how to retrieve the text. I would avoid to store the text as a separate node property. That would blow up either your database or contentstore. We do the same with our SmartTransformer to OCR docs like specific images stored in Alfresco. The SmartTransformer has it's own storage/cache for the document text extraction. If you do a reindex the transformer just reads the text from it's own cache to avoid any additional transformation for a content url wich has been already transformed.