Skip to main content

Character Encoding with Tomcat

Character Sets

A character set is a set of textual and graphic symbols, each of which is mapped to a set of nonnegative integers.

Character Encoding

A character encoding maps a character set to units of a specific width and defines byte serialization and ordering rules. Many character sets have more than one encoding. For example, Java programs can represent Japanese character sets using the EUC-JP or Shift-JIS encodings, among others. Each encoding has rules for representing and serializing a character set.

The ISO 8859 series defines 13 character encodings that can represent texts in dozens of languages. Each ISO 8859 character encoding can have up to 256 characters. ISO 8859-1 (Latin-1) comprises the ASCII character set, characters with diacritics (accents, diaereses, cedillas, circumflexes, and so on), and additional symbols.
UTF-8 (Unicode Transformation Format, 8-bit form) is a variable-width character encoding that encodes 16-bit Unicode characters as one to four bytes. A byte in UTF-8 is equivalent to 7-bit ASCII if its high-order bit is zero; otherwise, the character comprises a variable number of bytes.
UTF-8 is compatible with the majority of existing web content and provides access to the Unicode character set. Current versions of browsers and email clients support UTF-8. In addition, many new web standards specify UTF-8 as their character encoding. For example, UTF-8 is one of the two required encodings for XML documents (the other is UTF-16).
Web components usually use PrintWriter to produce responses; PrintWriter automatically encodes using ISO 8859-1. Servlets can also output binary data using OutputStream classes, which perform no encoding. An application that uses a character set that cannot use the default encoding must explicitly set a different encoding.

For web components, three encodings must be considered:
  • Request
  • Page (JSP pages)
  • Response


German characters encoding issue
Issue:-We had faced a problem regarding German characters encoding in our application GUI editor.
When GUI inserts the values in  database tables, it does not insert the German characters properly in the database and thus it is not able to retrieve those characters in proper format.

 Solution:-Need to add below code in particular jsp

<meta http-equiv="Content-Type"  content="text/html;  charset=ISO-8859-1"/>
Or
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

Understanding Character Encoding Issues Tomcat


Questions

1.      Why
1.      What is the default character encoding of the request or response body?
2.      Why does everything have to be this way?
2.      How
1.      How do I change how GET parameters are interpreted?
2.      How do I change how POST parameters are interpreted?
3.      What can you recommend to just make everything work? (How to use UTF-8 everywhere).
4.      How can I test if my configuration will work correctly?
5.      How can I send higher characters in HTTP headers?
What is the default character encoding of the request or response body?
For servlet: ISO-8859-1
The character encoding for the body of an HTTP message (request or response) is specified in the Content-Type header field.
 An example of such a header is Content-Type: text/html; charset=ISO-8859-1 which explicitly states that the default (ISO-8859-1) is being used.


Why does everything have to be this way?
Everything covered in this page comes down to practical interpretation of a number of specifications. Here are a couple of references before we cover exactly where these items are located in them.
·  URI Syntax
·  HTML 4

Default encoding for request and response bodies
Default encoding for GET
The character set for HTTP query strings (that's the technical term for 'GET parameters')
ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.

·  ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
·  Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).
·  HTML 4.0 recommends the use of UTF-8 to encode the query string.
·  When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST
ISO-8859-1 is defined as the default character set for HTTP request and response bodies in the servlet specification .
Some notes about the character encoding of a POST request:
1.      Section 3.4.1 of HTTP/1.1 states that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.
2.      Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).
How do I change how GET parameters are interpreted?
Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").
There are two ways to specify how GET parameters are interpreted:
1.      Set the URIEncoding attribute on the <Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
2.      Set the useBodyEncodingForURI attribute on the <Connector> element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters.
How do I change how POST parameters are interpreted?
POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO-8859-1).
In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial. Furthermore Tomcat already comes with such an example filter. Please take a look at:
5.x
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
6.x
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
5.5.36+, 6.0.36+, 7.x
Since 7.0.20 the filter became first-class citizen and was moved from the examples into core Tomcat and is available to any web application without the need to compile and bundle it separately. See documentation for the list of filters provided by Tomcat. The class name is:
org.apache.catalina.filters.SetCharacterEncodingFilter
It was also ported to older Tomcat versions and is available there starting with versions 5.5.36 and 6.0.36.

Note: The request encoding setting is effective only if it is done earlier than parameters are parsed. Once parsing happens, there is no way back. Parameters parsing is triggered by the first method that asks for parameter name or value. Make sure that the filter is positioned before any other filters that ask for request parameters. The positioning depends on the order of filter-mapping declarations in the WEB-INF/web.xml file, though since Servlet 3.0 specification there are additional options to control the order. To check the actual order you can throw an Exception from your page and check its stack trace for filter names.


What can you recommend to just make everything work? (How to use UTF-8 everywhere).

Using UTF-8 as your character encoding for everything is a safe bet. This should work for pretty much every situation.
In order to completely switch to using UTF-8, you need to make the following changes:
  1. Set URIEncoding="UTF-8" on your <Connector> in server.xml. References: HTTP Connector, AJP Connector.
  2. Use a character encoding filter with the default encoding set to UTF-8
  3. Change all your JSPs to include charset name in their contentType.
For example, use <%@page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka JSP Documents).
  1. Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8.
Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8").
  1. Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.
  2. Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. For more information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.


How can I test if my configuration will work correctly?
The following sample JSP should work on a clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".
<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
     <title>Character encoding test page</title>
   </head>
   <body>
     <p>Data posted to this form was:
     <%
       request.setCharacterEncoding("UTF-8");
       out.print(request.getParameter("mydata"));
     %>

     </p>
     <form method="POST" action="index.jsp">
       <input type="text" name="mydata">
       <input type="submit" value="Submit" />
       <input type="reset" value="Reset" />
     </form>
   </body>
</html>

How can I send higher characters in my HTTP headers?
You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.



Comments

Popular posts from this blog

Yahoo! Calendar "Add Event" Seed URL Parameters

I can't seem to find any official documentation on this, so here are my notes. Some information gathered from  http://richmarr.wordpress.com/tag/calendar/ Other information gathered through trial and error, and close examination of the "Add Event" form on Yahoo!'s site. Yahoo! Calendar URL Parameters Parameter Required Example Value Notes v Required 60 Must be  60 . Possibly a version number? TITLE Required Event title Line feeds will appear in the confirmation screen, but will not be saved. May not contain HTML. ST Required 20090514T180000Z Event start time in UTC. Will be converted to the user's time zone. 20090514T180000 Event start time in user's local time 20090514 Event start time for an all day event. DUR value is ignored if this form is used. DUR 0200 Duration of the event. Format is HHMM, zero-padded. MM may range up to 99, and is converted into hours appropriately. HH values over 24 hours appear to be modulated by 24. Durations t...

Java literals:

Java literals:           A constant value which can be assigned to a variable is known as Literal.If we are assigning any outside range value for any data type ,we will get a compile time error saying Possible Loss of Precision found int required byte. For the integral data types (int ,byte,short,long) : we are allowed to specify a literal value in any   one of the following three forms. ---> Decimal literal (normal way) --->Octa literal (prefixed with 0 ) --->Hexa decimal (prefixed with 0x ) int x=10 ------------> Decimal int x=010 ------------>Octa int x=0X10 -------------->Hexa In the Hexa decimal notation for the extra digits we are allowed to specify either small or upper case a,b,c or A,B,C ( this is one of the few places where java is not case sensitive ). Example: class Sample { public static void main(String add[]) { int i = 10; int j=010; int k=0x10; System.out.println( i+”….”+j...