Apologies if this has been posted elsewhere, but I did my best to search. The virtual Tomcat server in Afresco WCM 2.1 seems to be forcing content as UTF-8, even if it has been checked in as something else. I checked in an HTML file as SHIFT-JIS, but previewing it from the virtual Tomcat garbles on output.
Note that the latter goes through 8180, the virtual Tomcat run by Alfresco. Hopefully there is some configuration we are missing to avoid this force of UTF-8 encoding?
The virtualization server is really just Tomcat with some extra stuff behind the scenes to deal with virtualization. All of the rules governing how it interacts with character sets are exactly the same as Tomcat 5.5.25 (i.e.: Servlet/JSP 2.4/2.0) because it's the exact same codebase.
First, it might be worthwhile to inspect your files with a low-level tool (e.g.: 'dd' or 'vim -b') to see if the HTML files in question actually do contain a BOM or not.
Here's some information on character sets from the context of a servlet/JSP. Note: your examples are for HTML pages but because they're being delivered by a servlet container, the servlet spec's rules on charset specification/conversion still apply (as do those of HTML):
Going beyond this question a bit is the issue of how servlet containers deal with form-related I18N issues. When a browser does a POST, it should send a Content-Type header that looks like this:
However, early versions of Microsoft Internet Explorer (i.e.: IE) failed to include the ';' between the application type and the charset specifier. As a result, many websites came to handle the correct header badly. To deal with that, both IE and Firefox send back form data encoded using whatever encoding the page was supplied with (Mozilla attempted to include the proper header, but there were so many compatibility issues, they were forced to yank it). Therefore, if you ever do end up dealing with I18N issues in the context of forms, my advice to you is to set page charsets everywhere (both HTTP headers and HTML metadata).
While we're on the topic of I18N in general, it's also worth knowing that while most unicode encodings require a BOM, it's optional with UTF-8. Astonishingly (or perhaps not so astonishingly if you're a bit cynical about Sun), Java is intolerant of UTF-8 streams that include a BOM, even though they are perfectly legal (see: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 ). It's up to the app to deal with it…. If you're in the mood for some stomach-churning rationalization, check out Sun's reason for not fixing it: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911 Ultimately, they claim to be hemmed in by the possibility that others may have brittle workarounds in place, and they didn't want to break them… so everybody else relying on the standard has to bump their head and institute their own workaround, …one despondent engineer by one, globally, forever (or until the open sourcing of Java starts being felt, which ever comes first).
In short, check your HTML files, check your web.xml settings, and check your HTML meta declarations. Use low-level tools that allow you to see the exact bytestream you get back from the server, rather than merely inspecting things in your browser (that eliminates a whole other set of variables). An example of a low-level tool that might be useful to you for advanced debugging of webserver configuration problems is netcat.