Hyland Connect

abruzzi · ‎09-04-2014

A number of years ago (v3.1.2) I integrated our OCR server (Abbyy Recognition Server) with Alfresco. The integration was a quick hack that I wrote. Basically it used a PHP script to make the SOAP call to the OCR server, and the PHP was called by a RuntimeExec bean.

Now I'm trying to significantly beef up the integration, so I'm starting by creating my own class to do the SOAP call through Java. I'm new to Java, so I'm going slowly. I'm using the javax.xml.soap saaj-api to build and process the SOAP call. The actual file is sent base64 encoded in the XML of the SOAP request. I've used the following code to get the file content from the ContentReader, encode it, and place it into the SOAP message:


        String fileStr = reader.getContentString();
        byte[] fileBin = fileStr.getBytes("US-ASCII");

        String fileB64 = DatatypeConverter.printBase64Binary(fileBin);

        soapBodyFileContents.addTextNode(fileB64);
‍‍‍‍‍‍‍‍

With the new Java code, the server complains that it is getting an invalid file. So, using Wireshark (packet sniffer) I have eavesdropped on the conversation between my code and the OCR server, and also on the old (working) PHP code. Since in both eavesdropping sessions I am attempting to process the exact same file, the base64 encoding should look the same. Instead, they look very similar, but there are differences. I've tried different encodings in the .getBytes() method and the base64 changes but never to be identical to PHP version and never one that works:

Java Output:


<file>
   <FileContents>JVBERi0xLjQKJT8/Pz8NCjEgMCBvYmoKPDwgCi9UeXB…</FileContents>
</file>
‍‍‍‍‍

PHP Output:


<file>
   <FileContents>JVBERi0xLjQKJeLjz9MNCjEgMCBvYmoKPDwgCi9UeXB…</FileContents>
</file>
‍‍‍‍‍

You can see they are very similar, but not identical. My main question is is there some mistake in how I'm getting the content of the file out of the ContentReader (reader) and processing it that might be causing my problem? Like I said, I'm pretty weak at Java.

mrogers · ‎09-04-2014

The difference is down to content encoding. You are converting character encodings too much. FileStr will be a java string. You then convert that to us-ASCII bytes. ( which may not handle all chars ) And then you base 64 encode that ASCII stream.

But why use soap at all? Use cmis or web scripts instead.

abruzzi · ‎09-08-2014

SOAP is required by the OCR server–short of writing windows to the windows COM API, it is the only way to programmatically process a file. All the methods for generating base64 encoding seem to need me to step through a byte array. When I run the .getContentString() method, what encoding is alfresco using?

kaynezhang · ‎09-08-2014

When content is uploaded into alfresco ,alfresco will try to guess the encoding of it(the default encoding is UTF-8). After that the encoding will be used to in getContentString() method.
You can call getContentInputStream instead of getCotentString,and convert it to byte arry,then encode the byte arry using base64 encoder.


            // read from the stream into a byte[]
            InputStream is = getContentInputStream();
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            org.springframework.util.FileCopyUtils.copy(is, os);  
            byte[] bytes = os.toByteArray();
      //encode it using base64 encoder.
‍‍‍‍‍‍‍‍

Hyland Connect

Using ContentReader for transformation