Unicode (UTF8) in dbhas on 2.5
Yves Dorfsman
yves at zioup.com
Tue Oct 21 10:16:46 EDT 2008
Diez B. Roggisch <deets at nospam.web.de> wrote:
> Please write the following program and meditate at least 30min in front of
> it:
> while True:
> print "utf-8 is not unicode"
I hope you will have a better day today than yesterday !
Now, I did this:
while True:
print "¡ Python knows about encoding, but only sometimes !"
My terminal is setup in UTF-8, and... It did print correctly. I expected
that by setting coding: utf-8, all the I/O functions would do the encoding
for me, because if they don't then I, and everybody who writes a script, will
need to subclass every single I/O class (ok, except for print !).
> Bytestrings are just that - a sequence of 8-bit-values.
It used to be that int were 8 bits, we did not stay stuck in time and int are
now typically longer. I expect a high level language to let me set the
encoding once, and do simple I/O operation... without having encode/decode.
> Now the real world of databases, network-connections and harddrives doesn't
> know about unicode. They only know bytes. So before you can write to them,
> you need to "encode" the unicode data to a byte-stream-representation.
> There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
> which has the property that it can render *all* unicode characters,
> potentially needing more than one byte per character.
Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
chars right, but that's why I use a high level language. Here's an example
of an implementation that let you write Unicode directly to a dbhash, I
hoped there would be something similar in python:
http://www.oracle.com/technology/documentation/berkeley-db/db/gsg/JAVA/DBEntry.html
> db = dbhash.open('dbfile.db')
> smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')
> print smiley.encode('utf-8')
> The last encode is there to print out the smiley on a terminal - one of
> those pesky bytestream-eaters that don't know about unicode.
What are you talking about ?
I just copied this right from my terminal (LANG=en_CA.utf8):
>>> print unichr(0x020ac)
€
>>>
Now, I have read that python 2.6 has better support for Unicode. Does it allow
to write to file, bsddb etc... without having to encode/decode every time ?
This is a big enough issue for me right now that I will manually install 2.6
if it does.
Thanks.
--
Yves.
http://www.sollers.ca/blog/2008/no_sound_PulseAudio
http://www.sollers.ca/blog/2008/PulseAudio_pas_de_son/.fr
More information about the Python-list
mailing list