Unicode (UTF8) in dbhas on 2.5

Tue Oct 21 10:51:07 EDT 2008

On 20 Okt, 16:04, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
>
> What is the difference? The dbhash module can only work with *bytestrings*.
> Bytestrings are just that - a sequence of 8-bit-values.

Sounds like a prime candidate for some improvement work. Patches,
anyone? ;-)

> u""-literals are *unicode objects*. These are an abstract sequence of
> characters, smileys or others.

It's important to point this out, though. However...

> Now the real world of databases, network-connections and harddrives doesn't
> know about unicode. They only know bytes. So before you can write to them,
> you need to "encode" the unicode data to a byte-stream-representation.

Although this is true, what the inquirer probably expected was the
interfaces to these things handling such details. In the case of
filesystems, this can be awkward on, say, Linux or UNIX for various
historical reasons. With regard to database systems, some messy
configuration may need to be done for each database, but it would be
nice to see the interface modules doing a bit more of the work.

[...]

> print smiley.encode('utf-8')
>
> The last encode is there to print out the smiley on a terminal - one of
> those pesky bytestream-eaters that don't know about unicode.

With respect to output encodings, you don't need to perform an encode
operation if the locale is compatible, as discussed recently in
another thread. Encoding manually to UTF-8 may avoid errors, but it
doesn't guarantee that the output will make any sense.

Paul