Question concerning Unicode and or Shift-JIS

Andrew Clover and-google at doxdesk.com
Mon Mar 15 03:55:32 EST 2004


Antioch <sesshomaru_2k3 at SPAMTRAP.hotmail.com> wrote:

> However, when getting user input, the input it natively sent to the
> program in Shift-JIS encoding.

You can change this by setting the encoding of the web page containing
the <form>, either by having the server send an HTTP response header
'Content-Type: text/html;charset=utf-8', or by including a
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element inside the <head>.

If you don't set the charset like this, the browser will have to guess
what encoding to use, which in the absence of any properly-encoded
Japanese text on the form page could be anything, and may default to
different encodings dependent on the user's locale.

[Detour:]

The browser will send the form submission in the specified encoding,
*unless* the user deliberately goes to the browser's encodings menu
and selects a different one. Unlikely, but possible. The 'proper' way
around this is to write <form accept-charset="utf-8">, which should mean
that the browser should send the submission as UTF-8 regardless of the
encoding of the page containing the form. Unfortunately Internet
Explorer on Windows is broken and stupid, and prefers to use this as
a 'backup' encoding: it will use the current page's encoding for fields
which can be encoded in that, and the accept-charset encoding on
fields that contain characters that can't be encoded in the current
page's charset. Thus you can get a mixture of encodings with absolutely
no way to determine which is which.

The IE-compatible but utterly hideous workaround is to avoid
accept-charset and include a hidden field with name '_charset_' in the
form. IE will fill it in with the currently selected encoding when the
form is submitted. Whether it is worth doing this is debateable.

[end detour.]

> the problem is I don't know how to "decode" the UTF and then recode
> it into Shift-JIS so that I can compare the dictionary values

I'd definitely recommend storing the dictionary values as Unicode strings
rather than trying to compare encoded versions.

> I don't know how to decode any of the codecs. I'm sure theres just
> some simple function call

Yep:

  characterString= unicode(jisString, 'shift_jis')
  utfString= characterString.encode('utf-8')

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/



More information about the Python-list mailing list