Character Sets
A character set is a set of textual
and graphic symbols, each of which is mapped to a set of nonnegative integers.
Character Encoding
A character encoding maps
a character set to units of a specific width and defines byte serialization and
ordering rules. Many character sets have more than one encoding. For example,
Java programs can represent Japanese character sets using the EUC-JP or
Shift-JIS encodings, among others. Each encoding has rules for representing and
serializing a character set.
The ISO 8859 series defines 13 character encodings that can
represent texts in dozens of languages. Each ISO 8859 character encoding can
have up to 256 characters. ISO 8859-1 (Latin-1) comprises the ASCII character
set, characters with diacritics (accents, diaereses, cedillas, circumflexes,
and so on), and additional symbols.
UTF-8 (Unicode Transformation Format, 8-bit
form) is a variable-width character encoding that encodes 16-bit Unicode
characters as one to four bytes. A byte in UTF-8 is equivalent to 7-bit ASCII
if its high-order bit is zero; otherwise, the character comprises a variable
number of bytes.
UTF-8 is compatible with the majority of
existing web content and provides access to the Unicode character set. Current
versions of browsers and email clients support UTF-8. In addition, many new web
standards specify UTF-8 as their character encoding. For example, UTF-8 is one
of the two required encodings for XML documents (the other is UTF-16).
Web components usually use
PrintWriter
to produce responses; PrintWriter
automatically encodes using ISO
8859-1. Servlets can also output binary data using OutputStream
classes, which perform no encoding.
An application that uses a character set that cannot use the default encoding
must explicitly set a different encoding.
For web
components, three encodings must be considered:
German characters encoding issue
Issue:-We had faced a problem regarding
German characters encoding in our application GUI editor.
When GUI
inserts the values in database tables, it does not insert the German
characters properly in the database and thus it is not able to retrieve those
characters in proper format.
Solution:-Need to add below code in particular jsp
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
Or
<meta http-equiv="content-type"
content="text/html; charset=UTF-8">
Understanding Character Encoding Issues Tomcat
Questions
1. Why
1. What
is the default character encoding of the request or response body?
2. Why
does everything have to be this way?
2. How
1. How
do I change how GET parameters are interpreted?
2. How
do I change how POST parameters are interpreted?
3. What
can you recommend to just make everything work? (How to use UTF-8 everywhere).
4. How
can I test if my configuration will work correctly?
5. How
can I send higher characters in HTTP headers?
What is the default character
encoding of the request or response body?
For servlet: ISO-8859-1
The character encoding for the body of an HTTP message
(request or response) is specified in the Content-Type header field.
An example of such a
header is Content-Type: text/html; charset=ISO-8859-1
which explicitly states that the default (ISO-8859-1) is being used.
Why does everything have to
be this way?
Everything covered in this page comes down to practical
interpretation of a number of specifications. Here are a couple of references
before we cover exactly where these items are located in them.
·
HTML 4
Default encoding for request
and response bodies
Default encoding for GET
The character set for HTTP query strings (that's the
technical term for 'GET parameters')
ISO-8859-1 and ASCII are compatible for character codes 0x20
to 0x7E, so they are often used interchangeably. Most of the web uses
ISO-8859-1 as the default for query strings.
· ISO-8859-1
and ASCII are compatible for character codes 0x20 to 0x7E, so they are often
used interchangeably. Most of the web uses ISO-8859-1 as the default for query
strings.
· Many
browsers are starting to offer (default) options of encoding URIs using UTF-8
instead of ISO-8859-1. Some browsers appear to use the encoding of the current
page to encode URIs for links (see the note above regarding browser behavior
for POST encoding).
·
HTML
4.0 recommends the use of UTF-8 to encode the query string.
· When in doubt, use POST for any data you
think might have problems surviving a trip through the query string.
Default Encoding for POST
ISO-8859-1
is defined as the default character set for HTTP request and response bodies in
the servlet specification .
Some notes about the character encoding of a POST request:
1. Section
3.4.1 of HTTP/1.1 states that recipients of an HTTP message must
respect the character encoding specified by the sender in the Content-Type header if the encoding is
supported. A missing character allows the recipient to "guess" what
encoding is appropriate.
2. Most
web browsers today do not specify the character set of a request, even
when it is something other than ISO-8859-1. This seems to be in violation of
the HTTP specification. Most web browsers appear to send a request body using
the encoding of the page used to generate the POST (for instance, the
<form> element came from a page with a specific encoding... it is that
encoding which is used to submit the POST data for that form).
How do I change how GET parameters are
interpreted?
Tomcat will use ISO-8859-1 as the default character encoding
of the entire URL, including the query string ("GET parameters").
There are two ways to specify how GET parameters are
interpreted:
1. Set
the URIEncoding attribute on the
<Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
2. Set
the useBodyEncodingForURI
attribute on the <Connector> element in server.xml to true. This will cause the Connector to use
the request body's encoding for GET parameters.
How do I change how POST
parameters are interpreted?
POST requests should specify the encoding of the parameters
and values they send. Since many clients fail to set an explicit encoding, the
default is used (ISO-8859-1).
In many cases this is not the preferred interpretation so
one can employ a javax.servlet.Filter to set request encodings. Writing such a
filter is trivial. Furthermore Tomcat already comes with such an example
filter. Please take a look at:
5.x
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
6.x
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
5.5.36+,
6.0.36+, 7.x
Since
7.0.20 the filter became first-class citizen and was moved from the examples
into core Tomcat and is available to any web application without the need to
compile and bundle it separately. See documentation for the list of filters
provided by Tomcat. The class name is:
org.apache.catalina.filters.SetCharacterEncodingFilter
It was
also ported to older Tomcat versions and is available there starting with
versions 5.5.36 and 6.0.36.
Note: The
request encoding setting is effective only if it is done earlier than
parameters are parsed. Once parsing happens, there is no way back. Parameters
parsing is triggered by the first method that asks for parameter name or value.
Make sure that the filter is positioned before any other filters that ask for
request parameters. The positioning depends on the order of filter-mapping declarations in the WEB-INF/web.xml
file, though since Servlet 3.0 specification there are additional options to
control the order. To check the actual order you can throw an Exception from
your page and check its stack trace for filter names.
What can you recommend to
just make everything work? (How to use UTF-8 everywhere).
Using UTF-8 as your character encoding for everything is a
safe bet. This should work for pretty much every situation.
In order
to completely switch to using UTF-8, you need to make the following changes:
- Set URIEncoding="UTF-8" on your <Connector>
in server.xml. References: HTTP
Connector, AJP
Connector.
- Use a character encoding
filter with the default encoding set to UTF-8
- Change all your JSPs to
include charset name in their contentType.
For example, use <%@page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka
JSP Documents).
- Change all your servlets to
set the content type for responses and to include charset name in the
content type to be UTF-8.
Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8").
- Change any
content-generation libraries you use (Velocity, Freemarker, etc.) to use
UTF-8 and to specify UTF-8 in the content type of the responses that they
generate.
- Disable any valves or
filters that may read request parameters before your character encoding
filter or jsp page has a chance to set the encoding to UTF-8. For more
information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.
How can I
test if my configuration will work correctly?
The
following sample JSP should work on a clean Tomcat install for any input. If
you set the URIEncoding="UTF-8" on the connector, it will also work
with method="GET".
<%@ page contentType="text/html;
charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML
4.01 Transitional//EN">
<html>
<head>
<title>Character encoding test page</title>
</head>
<body>
<p>Data posted to this form was:
<%
request.setCharacterEncoding("UTF-8");
out.print(request.getParameter("mydata"));
%>
</p>
<form
method="POST" action="index.jsp">
<input type="text"
name="mydata">
<input type="submit" value="Submit" />
<input type="reset" value="Reset" />
</form>
</body>
</html>
How can I
send higher characters in my HTTP headers?
You have
to encode them in some way before you insert them into a header. Using
url-encoding (% + high byte number + low byte
number) would be a good idea.
Comments
Post a Comment