Hyland Connect

bfranke · ‎10-23-2007

Apologies if this has been posted elsewhere, but I did my best to search.
The virtual Tomcat server in Afresco WCM 2.1 seems to be forcing content as UTF-8, even if it has been checked in as something else.
I checked in an HTML file as SHIFT-JIS, but previewing it from the virtual Tomcat garbles on output.

To reproduce, from the content details screen:

View in Browser - GOOD; displays file exactly as checked in
http://localhost:8080/alfresco/d/d/avm/mysite-live/-1;www;avm_webapps;ROOT;corp;tos_shiftjis.html/to...

Preview File - NO GOOD; seems to add UTF-8 BOM which garbles file
http://mysite-live.www--sandbox.172-100-100-100.ip.alfrescodemo.net:8180/corp/tos_shiftjis.html

Note that the latter goes through 8180, the virtual Tomcat run by Alfresco.
Hopefully there is some configuration we are missing to avoid this force of UTF-8 encoding?

jcox · ‎10-31-2007

The virtualization server is really just Tomcat with some extra
stuff behind the scenes to deal with virtualization. All of the
rules governing how it interacts with character sets are exactly
the same as Tomcat 5.5.25 (i.e.: Servlet/JSP 2.4/2.0) because
it's the exact same codebase.

First, it might be worthwhile to inspect your files with a low-level
tool (e.g.: 'dd' or 'vim -b') to see if the HTML files in question
actually do contain a BOM or not.

Here's some information on character sets from the context
of a servlet/JSP.   Note:   your examples are for HTML pages
but because they're being delivered by a servlet container,
the servlet spec's rules on charset specification/conversion
still apply (as do those of HTML):

http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
http://java.sun.com/developer/technicalArticles/Intl/MultilingualJSP/
http://www.w3.org/TR/REC-html40/charset.html

Going beyond this question a bit is the issue of how servlet containers
deal with form-related I18N issues.    When a browser does a POST,
it should send a Content-Type header that looks like this:

Content-type: application/x-www-form-urlencoded; charset=YOUR-CHARSET

However, early versions of Microsoft Internet Explorer (i.e.: IE)
failed to include the ';' between the application type and the
charset specifier. As a result, many websites came to handle
the correct header badly. To deal with that, both IE and Firefox
send back form data encoded using whatever encoding the page was
supplied with (Mozilla attempted to include the proper header, but
there were so many compatibility issues, they were forced to yank it).
Therefore, if you ever do end up dealing with I18N issues in the
context of forms, my advice to you is to set page charsets everywhere
(both HTTP headers and HTML metadata).

While we're on the topic of I18N in general, it's also worth knowing
that while most unicode encodings require a BOM, it's optional with
UTF-8. Astonishingly (or perhaps not so astonishingly if you're
a bit cynical about Sun), Java is intolerant of UTF-8 streams
that include a BOM, even though they are perfectly legal
(see: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 ).
It's up to the app to deal with it…. If you're in the mood for some
stomach-churning rationalization, check out Sun's reason for not fixing it:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911
Ultimately, they claim to be hemmed in by the possibility that others may
have brittle workarounds in place, and they didn't want to break them…
so everybody else relying on the standard has to bump their head
and institute their own workaround, …one despondent engineer by one,
globally, forever (or until the open sourcing of Java starts being felt,
which ever comes first).

In short, check your HTML files, check your web.xml settings,
and check your HTML meta declarations.   Use low-level tools
that allow you to see the exact bytestream you get back from
the server, rather than merely inspecting things in your browser
(that eliminates a whole other set of variables).   An example of
a low-level tool that might be useful to you for advanced debugging
of webserver configuration problems is netcat.

I hope this helps,
- Jon

Hyland Connect

Virtual Tomcat forcing UTF-8