cancel
Showing results for 
Search instead for 
Did you mean: 

Determining the encoding of a text document

hbf
Champ on-the-rise
Champ on-the-rise
Hi,

For an AMP I am developing I need a way to determine from an input stream the encoding of a document. I want to store the latter in Alfresco and need to know the mime-type (which I know how to determine) and the encoding.

In the Alfresco API I've found that the MimetypeService provides a way:

mimetypeService.getContentCharsetFinder().getCharset(streamSupportingMark, type)

In the code I see that a certain CharactersetFinder implementation (GuessEncodingCharsetFinder) is being run on the stream. In my case it "fails": I have an HTML document containing

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

but GuessEncodingCharsetFinder is not making use of this as it is a "last resort" encoding guesser that ignores meta-information present in the file (if I am not mistaken).

Is there a plan to add a CharactersetFinder that looks for a "charset=" in the meta-area and "guesses" from this?

Or am I on the wrong track using MimetypeService's getContentCharsetFinder()? I am not sure…

If getContentCharsetFinder() *is* the right approach and I write my own CharactersetFinder, how can I configure Alfresco to use it? (I don't want to change core-services-context.xml.) Of course, I'd contribute my finder…

Many thanks,
Kaspar
3 REPLIES 3

kevinr
Star Contributor
Star Contributor
Yes you will need to write your charset finder for that. If you contribute we can add it to the core Smiley Happy

You can override any bean using a *-custom.xml context file in your alfresco extension folder. Take a look at the Repository Configuration page off the Developer Guide link in my sig below.

Kevin

derek
Star Contributor
Star Contributor
You don't have to change the core-services-context to change the bean.  If you write our own finder, let's say org.alfresco.encoding.HtmlCharsetFinder, you can add the following bean to your custom-repository-context.xml:

    <bean id="charset.finder" class="org.alfresco.repo.content.encoding.ContentCharsetFinder">
      <property name="defaultCharset">
         <value>UTF-8</value>
      </property>
      <property name="mimetypeService">
         <ref bean="mimetypeService"/>
      </property>
      <property name="charactersetFinders">
         <list>
            <bean class="org.alfresco.encoding.GuessEncodingCharsetFinder" />
            <bean class="org.alfresco.encoding.HtmlCharsetFinder" />
         </list>
      </property>
    </bean>
From the Javadocs:

    /**
     * Worker method for implementations to override.  All exceptions will be reported and
     * absorbed and <tt>null</tt> returned.
     * <p>
     * The interface contract is that the data buffer must not be altered in any way.
     *
     * @param buffer            the buffer of data no bigger than the requested
     *                          {@linkplain #getBestBufferSize() best buffer size}.  This can,
     *                          very efficiently, be turned into an <tt>InputStream</tt> using a
     *                          <tt>ByteArrayInputStream<tt>.
     * @return                  Returns the charset or <tt>null</tt> if an accurate conclusion
     *                          is not possible
     * @throws Exception        Any exception, checked or not
     */
    protected abstract Charset detectCharsetImpl(byte[] buffer) throws Exception;
All finders will be attempted, in order, until a non-null is returned.  So, any contributed finders would be appreciated and attached to that list provided that they don't produce poor guesses: A null guess is better than an incorrect guess.  You finder can also specialize in HTML and just return null for everything else.

Regards

derek
Star Contributor
Star Contributor
Another thing: The default buffer size given to your decoder is 8K.  To detect HTML encoding, you don't need that much, so you can configure your detector bean with much less.