Unicode (UTF8) in dbhas on 2.5

Yves Dorfsman yves at zioup.com
Tue Oct 21 10:16:46 EDT 2008


Diez B. Roggisch <deets at nospam.web.de> wrote:

> Please write the following program and meditate at least 30min in front of
> it:

> while True:
>    print "utf-8 is not unicode"

I hope you will have a better day today than yesterday !
Now, I did this:

while True:
  print "¡ Python knows about encoding, but only sometimes !"

My terminal is setup in UTF-8, and... It did print correctly. I expected
that by setting coding: utf-8, all the I/O functions would do the encoding
for me, because if they don't then I, and everybody who writes a script, will
need to subclass every single I/O class (ok, except for print !).


> Bytestrings are just that - a sequence of 8-bit-values.

It used to be that int were 8 bits, we did not stay stuck in time and int are
now typically longer. I expect a high level language to let me set the
encoding once, and do simple I/O operation... without having encode/decode.

> Now the real world of databases, network-connections and harddrives doesn't
> know about unicode. They only know bytes. So before you can write to them,
> you need to "encode" the unicode data to a byte-stream-representation.
> There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
> which has the property that it can render *all* unicode characters,
> potentially needing more than one byte per character.

Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
chars right, but that's why I use a high level language. Here's an example
of an implementation that let you write Unicode directly to a dbhash, I
hoped there would be something similar in python:
http://www.oracle.com/technology/documentation/berkeley-db/db/gsg/JAVA/DBEntry.html

> db = dbhash.open('dbfile.db')
> smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')

> print smiley.encode('utf-8')


> The last encode is there to print out the smiley on a terminal - one of
> those pesky bytestream-eaters that don't know about unicode.

What are you talking about ?
I just copied this right from my terminal (LANG=en_CA.utf8):

>>> print unichr(0x020ac)
€
>>> 

Now, I have read that python 2.6 has better support for Unicode. Does it allow
to write to file, bsddb etc... without having to encode/decode every time ?
This is a big enough issue for me right now that I will manually install 2.6
if it does.

Thanks.

-- 
Yves.
http://www.sollers.ca/blog/2008/no_sound_PulseAudio
http://www.sollers.ca/blog/2008/PulseAudio_pas_de_son/.fr




More information about the Python-list mailing list