Text is extracted from MS Office documents using Open Office server. It successfully extracts text from Word, PowerPoint and Excel. PDFBox is used to extract text from PDF documents. Text is extracted from HTML documents using the built in HTML->text support in the Java Swing library.
So the "quality" of extraction is directly related to the quality of those 3rd party libraries and services.
Thanks,
Kevin