Unicode (UTF8) in dbhas on 2.5

Tue Oct 21 12:13:57 EDT 2008

Yves Dorfsman wrote:

> Diez B. Roggisch <deets at nospam.web.de> wrote:
> 
>> Please write the following program and meditate at least 30min in front
>> of it:
> 
>> while True:
>>    print "utf-8 is not unicode"
> 
> I hope you will have a better day today than yesterday !

I had a good day yesterday. And today. Thanks for asking.

Partially feeling good stemmed from the fact that I didn't "try to put
UTF-8-characters into a berkley-db" and claimed it fails, where what I
*really* tried was putting unicode-strings into it. Unicode and UTF-8 are
two different things, like it or not.

> Now, I did this:
> 
> while True:
>   print "¡ Python knows about encoding, but only sometimes !"
> 
> My terminal is setup in UTF-8, and... It did print correctly. I expected
> that by setting coding: utf-8, all the I/O functions would do the encoding
> for me, because if they don't then I, and everybody who writes a script,
> will need to subclass every single I/O class (ok, except for print !).

You seriously want all IO to be encoded depending on your terminal setting?
What about the database that works in latin1? The CSV file you write to
your vendor, expecting cp1251? And what happens if your process is not
*started* from a terminal? Or a different user starts the script, and all
of a sudden the exported data is messed up?

> 
>> Bytestrings are just that - a sequence of 8-bit-values.
> 
> It used to be that int were 8 bits, we did not stay stuck in time and int
> are now typically longer. I expect a high level language to let me set the
> encoding once, and do simple I/O operation... without having
> encode/decode.

Sorry to say so, but you must face the sad truth: IO ops *need* explicit
encoding applied to them, otherwise errors will occur. Ask the Java-guys
why the needed to grow encoding-parameters to all their toBytes/fromBytes
functions in the IO-layer.

There is nothing that can be done about this. Which is not to say that
Python couldn't be enhanced at some places wrt unicode-handling, see below.

> Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
> chars right, but that's why I use a high level language. Here's an example
> of an implementation that let you write Unicode directly to a dbhash, I
> hoped there would be something similar in python:
>
http://www.oracle.com/technology/documentation/berkeley-db/db/gsg/JAVA/DBEntry.html

The inner workings of the DB are still only byte-aware. I agree that you
could enhance the berkley-db-interface in python so that it takes a
default-encoding parameter, then transcoding all values from and to it. 

OTOH you can help yourself writing a simple wrapper that does that for you,
untested:

class UnicodeWrapper(object):

   def __init__(self, bdb, encoding="utf-8"):
       self.bdb = bdb
       self.encoding = encoding

   def __setitem__(self, key, value):
       if isinstance(key, unicode):
          key = key.encode(self.encoding)
       if isinstance(value, unicode):
          value = value.encode(self.encoding)
       self.bdb[key] = value

   def __getitem__(self, key):
       if isinstance(key, unicode):
          key = key.encode(self.encoding)
       return self.bdb[key]

> 
>> db = dbhash.open('dbfile.db')
>> smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')
> 
>> print smiley.encode('utf-8')
> 
> 
>> The last encode is there to print out the smiley on a terminal - one of
>> those pesky bytestream-eaters that don't know about unicode.
> 
> What are you talking about ?
> I just copied this right from my terminal (LANG=en_CA.utf8):
> 
>>>> print unichr(0x020ac)
> €
>>>> 

You are right, that works of course - when running inside a terminal. It
will fail though if the encoding can't be guessed, e.g. because the process
is not spawned from a terminal.

Nothing to do with the terminal though.

Diez