Encoding troubles

Bryan bryanjugglercryptographer at yahoo.com
Mon May 17 22:38:51 EDT 2010


Neil Hodgson wrote:
> JB:
>
> > as hypens (–) and apostrophes (’) are in an odd encoding. When passed
> > to the database using sqlalchemy they appear as – and other
> > characters.
>
>    The encoding is UTF-8. Normally the best way to handle encodings is
> to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
> and perform most processing in Unicode.

Good advice to work in Unicode (and in Python 3.X str is unicode), but
I'd guess the encoding he's getting is "Windows-1252". The default
character set of HTTP is ISO-8859-1, but Microsoft likes to use
Windows-1252 in it's place.

What to do about it? First, try specifying utf-8 in the form
containing the textarea, as in

  <form action="process.cgi" accept-charset="utf-8">

Note that specifying ISO-8859-1 will not work, in that Microsoft will
still use Windows-1252. I've heard they've gotten better at supporting
utf-8, but I haven't tested.

When a request comes in, check for a Content-Type header that names
the character set, which should be:

  Content-Type: application/x-www-form-urlencoded; charset=utf-8

Then you con decode to a unicode object as Neil Hodgson explained.

In case you still have to deal with Windows-1252, Python knows how to
translate Windows-1252 to the best-fit in Unicode. In current Python
2.x:

  ustring = unicode(raw_string, 'Windows-1252')

In Python 3.X, what comes from a socket is bytes, and str means
unicode:

  ustring = str(raw_bytes, 'Windows-1252')


Of course this all assumes that JB's database likes Unicode. If it
chokes, then alternatives include encoding back to utf-8 and storing
as binary, or translating characters to some best-fit in the set the
database supports.


--
--Bryan Olson



More information about the Python-list mailing list