[Python-3000] Bytes and unicode conversion in C extensions

Jesus Cea jcea at jcea.es
Tue Jul 29 16:32:30 CEST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Working on the 3.0 version of bsddb, I have the following issue.

Until 3.0, keys and values were strings. For bsddb, they are opaque, and
stored unchanged.

In 3.0 the string type is replaced by unicode. A new "byte" type is
added. So, code like "db.put('key','value')" needs to be changed to
"db.put(bytes('key', 'utf-8'), bytes('value', 'utf-8'))", or something
similar.

This is ugly and generates incompatible code with previous python releases.

I was wondering what to do. The obvious path would be to put a proxy
object between application code and bsddb, doing the byte<->unicode
translation on the fly. This could be problematic when dealing with
legacy data, since it couldn't be a valid encoded bytestring. Data
misspresentation would be dangerous and can go undetected for a long
time, slowly corrupting the database data.

Moreover, the data is application specific, so automatic conversion can
introduce incompatibilities and bugs.

Another approach would be to add a new bsddb method to specify the
default encoding to use to convert unicode->bytes, and to do the
conversion internally when getting unicode data as a parameter. The
issue here is that "u'hi' != b'hi'", so the translation must be done
both when storing and when retrieving data.

These problems are caused because now string!=bytes. In fact the
approach in 3.0 is the right one, and any try to hide this difference
with proxy objects or automatic conversion is going to bite us, someday.

So, I'm thinking seriously in accepting *ONLY* "bytes" in the bsddb API
(when working under Python 3.0), and do the proxy thing *ONLY* in the
testsuite, to be able to reuse it.

What do you think?.

PS: Since most of the time keys/values are 7bit, a direct "ascii"
encoding would be fine... until we are required to store a 8 bit value.

PPS: In dbm (gdbm) I'm seeing automatic unicode->byte conversion, but NO
byte->unicode. See the problem when storing non ASCII data:

"""
Python 3.0b2 (r30b2:65080, Jul 19 2008, 03:39:09)
[GCC 4.2.3] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
|>> import dbm
|>> a=dbm.open("z","c")
|>> a
<_gdbm.gdbm object at 0x82fb560>
|>> a["a"]="b"
|>> a["b"]="c"
|>> a.sync()
|>> a.close()
|>> a=dbm.open("z","w")
|>> a.keys()
[b'b', b'a']
|>> a["c"]=chr(210)
|>> a["c"]
b'\xc3\x92'
|>> a["c"]==chr(210)
False
"""

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBSI8p+Jlgi5GaxT1NAQKTggP/R+swZ429fecTyNahJj6dw9nJfMgg7YcE
NbkueWM4zhUhhKa03sCT9ACiFHaXhmPyF2Q75wrGeI+WZxtafbYj+sjhjyCXpikn
cptAnWxXMEchqshwGafXoUi9eyVLMxihvulDf9rXJIqWLR8oRqoRaiJJPWf39ZCk
VhF+L1uKWiw=
=A3en
-----END PGP SIGNATURE-----


More information about the Python-3000 mailing list