Hello,
the upcoming release of Alfresco uses a newer version of the PDFBox library (1.5 in my snapshot from community-trunk, the most recently published by Apache being 1.6), so I'd expect this problem to be resolved then (as the PDFBox enhancement was included in 1.3.1).
I actually can't say how people other than us are handling this as information sharing on this level of detail is rather limited. What we do for OCR solutions in terms of Portals or ECM systems is to have an OCR tool extract text and use this for indexing only. We keep the scanned original file as-is and do not update it in any way. This prevents these types of issues as it achieves an independence from the way a proprietary conversion-based OCR tool handles passages it can't process.
I have no information on how Kofax may be affected or circumvents this pitfall. I also have no comprehensive information on what types of embedded objects other OCR tools use for image elements.
I'd advise to wait for the community release of Alfresco 4.0 in the next weeks and verify this.
Regards