02-16-2021 06:44 AM
Hello,
I am rather new to Alfresco and I have an interesting problem regarding document indexing.
I already have in place a system to scan documents and extract the content from them.
What I would want is to save the document in Alfresco and provide the already extracted text so that it can be indexed by that.
Is there any way to tell Alfresco or Solr to use the text provided by me for the index?
Thank you!
02-17-2021 02:38 PM
Short answer on yor question #3: If you want to implement your own transformer (T-Engine), this is the place to read on: https://docs.alfresco.com/transform/concepts/extend-transforms.html
Some more thoughts:
If you can generate PDF document with recognized text content in it as an output from your scanning system (many scanning and OCR solutions do this - text over image or image over text). It is enough to upload these PDFs to Alfresco and it will provide text content to Solr for indexing. Alfresco uses Apache Tika for text extraction from various formats and there are also various other transformers. You can see what transformations are supported in your repository on this URL: /alfresco/service/mimetypes?mimetype=application/pdf#application/pdf (look for Transfomable to and text/plain in the details for your chosen mimetype).
Solr requests text content for fulltext indexing via this repository URL: /alfresco/s/api/solr/textContent?nodeId=xxx (more info on nodeId here nodeId <-> noderef) and then the transformation to text/plain happens in Alfresco.
02-16-2021 07:09 AM
Hi and welcome!
Being honest, I think you are mixing concepts.
From one side you have your document in alfresco with data. For other side, you have an index created in Solr to return back this file in your UI. So, the document data is managed on Alfresco side, not solr.
What I would want is to save the document in Alfresco and provide the already extracted text so that it can be indexed by that
You will need to save the data in a custom property, for example. With that you can make available this information to Alfresco system. To reach that point you will need:
I hope this can be helpful.
Cheers.
02-17-2021 01:41 AM
Hello and thank you for your response!
From what I understand you are suggesting:
1. Are these new properties beeing sent by default to Solr for indexing? or are there extra steps to be done?
The problem that I can see with this approach is that when searcing for keywords, I can find the document but there would not be any highlight done to the actual keywords (searching in Alfresco Share)
What I am hoping to achieve is to tap in the actual document data extraction mechanism of Alfresco and provide myself the actual data.
2. One thing that I am not sure about is who is doing the data extraction? Alfresco or Solr?
3. Can I tap in that mechanism, with code or some API to actually do some of the work myself?
My end goal here is to provide an intuitive user experience regarding the search functionality, both with dicoverability and user friendliness (by showing highlighted serached text)
02-17-2021 02:38 PM
Short answer on yor question #3: If you want to implement your own transformer (T-Engine), this is the place to read on: https://docs.alfresco.com/transform/concepts/extend-transforms.html
Some more thoughts:
If you can generate PDF document with recognized text content in it as an output from your scanning system (many scanning and OCR solutions do this - text over image or image over text). It is enough to upload these PDFs to Alfresco and it will provide text content to Solr for indexing. Alfresco uses Apache Tika for text extraction from various formats and there are also various other transformers. You can see what transformations are supported in your repository on this URL: /alfresco/service/mimetypes?mimetype=application/pdf#application/pdf (look for Transfomable to and text/plain in the details for your chosen mimetype).
Solr requests text content for fulltext indexing via this repository URL: /alfresco/s/api/solr/textContent?nodeId=xxx (more info on nodeId here nodeId <-> noderef) and then the transformation to text/plain happens in Alfresco.
02-18-2021 08:33 AM
Is there really a need to store the doc and extracted text independant?
If not you should go with PDFs already containing the text as already suggested. This should be the preferred way to go. You could store the original doc inside the PDF container to keep the original format for compliance reasons and there are nice tools which support that conversion.
If you really need to keep the original doc as it is stored in Alfresco (e.g. TIFs) the best way to go would be to implement a specific Alfresco transformer / t-engine which is clever enough to read and store the full text using it's own mechanism. When solr requests the text from Alfresco to create the index entry the transformer should know how to retrieve the text. I would avoid to store the text as a separate node property. That would blow up either your database or contentstore. We do the same with our SmartTransformer to OCR docs like specific images stored in Alfresco. The SmartTransformer has it's own storage/cache for the document text extraction. If you do a reindex the transformer just reads the text from it's own cache to avoid any additional transformation for a content url wich has been already transformed.
Explore our Alfresco products with the links below. Use labels to filter content by product module.