[Python-3000] Immutable bytes type and dbm modules

Guido van Rossum guido at python.org
Tue Aug 7 04:06:34 CEST 2007


On 8/6/07, Mike Klaas <mike.klaas at gmail.com> wrote:
> On 6-Aug-07, at 5:39 PM, Guido van Rossum wrote:
> > Given that the *dbm types strive for emulating dicts, I think it makes
> > sense to use strings for the keys, and bytes for the values; this
> > makes them more plug-compatible with real dicts. (We should ideally
> > also change the keys() method etc. to return views.) This of course
> > requires that we know the encoding used for the keys. Perhaps it would
> > be acceptable to pick a conservative default encoding (e.g. ASCII) and
> > add an encoding argument to the open() method.
> >
> > Perhaps this will work? It seems better than using str8 or bytes
> > for the keys.
>
> There are some scenarios that might be difficult under such a regime.
>
> The berkeley api provides means for efficiently mapping a bytestring
> to another bytestring.  Often, the data is not text, and the
> performance of the database is sensitive to the means of serialization.
>
> For instance, it is quite common to use integers as keys.  If you are
> inserting keys in order, it is about a hundred times faster to encode
> the ints in big-endian byte order than than little-endian:

I'm assuming that this speed difference says something about the
implementation of the underlying dbm package. Which package did you
use to measure this?

> class MyIntDB(object):
>         def __setitem__(self, key, item):
>                self.db.put(struct.pack('>Q', key), serializer(item))
>          def __getitem__(self, key):
>                return unserializer(self.db.get(struct.pack('>Q', key)))
>
> How do you envision these types of tasks being accomplished with
> unicode keys?  It is conceivable that one could write a custom
> unicode encoding that accomplishes this, convert the key to unicode,
> and pass the custom encoding name to the constructor.

Well, the *easiest* (I don't know about simplest) way to use ints as
keys is of course to use the decimal representation. You'd use
str(key) instead of struct.pack(). This would of course not maintain
key order -- is that important? If you need to be compatible with
struct.pack(), and we were to choose Unicode strings for the keys in
the API, then you might have to do something like
struct.pack(...).encode("latin-1") and specify latin-1 as the
database's key encoding.

Of course this may not be compatible with an external constraint (e.g.
another application that already has a key format) but in that case
you may have to use arbitrary tricks anyway (the latin-1 encoding
might still be helpful).

However, I give you that a pure bytes API would be more convenient at times.

How about we define two APIs, using raw bytes and one using strings +
a given encoding?

Or perhaps a special value of the encoding argument passed to
*dbm.open() (maybe None, maybe the default, maybe "raw" or "bytes"?)
to specify that the key values are to be bytes?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list