I'm relatively new to Alfresco and have recently setup an environment where scanned bitmaps are run through a transformer for text/plain through tesseract OCR. This works brilliantly for single-page documents scanned into PNG, JPEG, TIFF, etc.
For multi-page documents my scanner will create a PDF. However, the standard transformer for text/plain obviously doesn't do OCR. For documents in this "scanner" space I'd always want to run them through OCR (probably a custom transformer which I have no trouble coding). However I don't wish to remove/override the PDFBox transformer for the majority of PDFs that already contain extract-able text.
So what's the best solution here? I'm thinking I could extend the PDFBox one to include OCR and merge the results, but this seems a bit messy. Is there a way to chain multiple transformers together for a given mime-type? Or is there a way to specify a specific transformer based on the space?
Or should I create a rule that runs on PDFs in this space to OCR them and place the text in a specific property that's set to searchable? Or perhaps something else I'm completely oblivious to.
Suggestions?
Thanks!
-Patrick