We have a little transformer that interacts with Tesseract OCR and is able to extract text from image PDFs. Despite the fact that Tesseract does not do layout analysis this approach it's quite useful for generating some indexable text in documents that otherwise wouldn't be searchable at all.
The problem is that, due to Lucene's way of working, whenever a document is updated, whatever that updating changes, all the reindexing for this node is done again.
That might not be a big deal in a fast transformer, but OCR transforming can easily take over 10 seconds per page so executing a script that changes 1000 documents might keep your server busy easily for over 3 hours. This also implies that due to the batching strategy used in Alfresco for index writing you might not notice any "improvement" in your searches until all job is done. If you upload a "TXT/Office/PDF with Text" file right after the 1000 OCR requiring files have changed you won't be able to search for the content of your last file until all OCR ends…
Does SOLR indexing change this in any way ? Not sure, but I guess no as I suppose that the transformation process is always done in Alfresco server in a similar way. Someone can confirm ?
What I want to do to solve this is create an "ExternalizedTransformer". I intent to build a transformer that simply adds a text like "ExternalizedTransformerPending".
Then a secondary process/API or whatever would search those nodes and process/transform them to text. After processing them it would attach an aspect with a "bnv:extractedText" property that would "hold" the text.
Here comes the tricky part:
After updating obviously the transformer will fire again but this time "ExternalizedTransformer" should detect that "bnv:extractedText" exists and take its value as the payload for the transformer. From now on, whenever that node is updated the transformer will recover it's text payload from the property which should be "lightning fast".
Obviously I will need to complement this with a behaviour attached to versión update so that it deletes the "bnv:extractedText" property whenever there is a content change, but that doesn't seem like a big deal.
The problem is, how do I get a reference to the node being transformed ? Java transformers usually extend AbstractContentTransformer2, but with those ones you have access to a ContentReader, not the node itself. Having access to the node is essential for "the tricky part", If I can't detect the aspect I can't use this approach.
Any suggestion or idea ?
Thank you