i18n hell

Martin Blais blais at furius.ca
Mon Apr 24 12:16:41 EDT 2006


On 24 Apr 2006 00:38:42 -0700, Serge Orlov <Serge.Orlov at gmail.com> wrote:
> fyleow wrote:
> > I just spent hours trying to figure out why even after I set my SQL
> > table attributes to UTF-8 only garbage kept adding into the database.
> > Apparently you need to execute "SET NAMES 'utf8'" before inserting into
> > the tables.
> >
> > Does anyone have experience working with other languages using Django
> > or Turbogears?  I just need to be able to retrieve and enter text to
> > the database from my page without it being mangled.  I know these
> > frameworks employ ORM so you don't need to write SQL and that worries
> > me because I tried this on Rails and it wouldn't work.
>
> Frequently asked question to people who are burning in i18n hell: are
> you using unicode strings or byte strings? Unicode string means that
> type(your_string) is unicode, it does not mean you keep utf-8 encoded
> text in python byte strings.

I used to live i18n hell, a while ago, until I understood this: 
everytime you keep a reference to some kind of string object, ALWAYS
ALWAYS ALWAYS be AWARE of whether it is not encoded (a unicode object)
or an encoding string (a str object), and if so, which encoding it is
in.  Then deal with the conversion between the two domains EXPLICITLY
(e.g. encode(), decode()).   If you hold onto a str or unicode object
and you don't know which it is, you are inevitably bound to face
unicode hell at some point.  You can use a prefix convention if that
makes it easier for you, but the point is that you CANNOT just "wing
it".  Python makes it too easy to just "wing it" and that creates a
lot of surprises, especially since some methods hide the conversions,
e.g. str.join.

w.r.t. to DB storage, that depends on the specific database you're
using and the DBAPI module you're using, read up on it, write a few
tests on your corresponding DBAPI (simple tests, easy peasy), know
what kinds of strings you're sending in and reading back.  I'm using
PostgreSQL often and my configuration always stores strings in UTF-8
in the database.  I have a lightweight mapping module that
disambiguiates and does the encoding/decoding automatically in a
consistent way (that decision belongs in the client code for now,
unfortunately, but is centralized using my table declaration that
lists the desired conversions for each column).  See
http://furius.ca/antiorm/ for something simple that works well.

cheers,




--
Martin
Furius Python Training -- http://furius.ca/training/



More information about the Python-list mailing list