[Python-Dev] Array Enhancements

Fri, 05 Apr 2002 16:21:56 -0500

> *** Adding a new typecode 't' to implement a bit array.  Implementation
> would be an array of bytes, but it would be 1 bit per element.  't' is
> for 'truth value' since 'b' for 'boolean' was already taken by 'b' for
> 'byte'.  Accepted values would be 0, 1, True, False.  Looking at the
> arraymodule.c, this seems like the most work out of any of theses
> suggestions because all of the other types are easily addressable. 
> Slicing operations are going to be tricky to do quickly and correctly. 
> It was already acknowledged by GvR that this would be desirable.

And I'll repeat: I doubt any of the Python developers has time to
implement this, but we'll gladly take patches.  (Of course, if it
takes a lot of work to punch the patches into shape, we'll gladly sit
on them forever -- so it's up to you to make the patches work. :-)
Be sure to use the latest CVS.

> *** Adding pickle support.  This is a no brainer I think.  Currently we
> have to call tostring() if we want to serialize the data.

The array module was designed to allow *efficient* reading and writing
of array data from/to files, using the fromfile() and tofile()
methods.  It will be hard to beat these.  But maybe you can use them
as part of the pickling.

It would be nice if a pickled array was unpicklable by previous Python
versions, but that may be too slow.  E.g. here's a way to pickle
arrays today:

    import pickle
    import copy_reg
    import array

    def reduce_array(a):
	return array.array, (a.typecode, a.tolist())

    copy_reg.pickle(type(array.array('i')), reduce_array, array.array)

    a = array.array('i', range(10))
    print a
    s = pickle.dumps(a)
    b = pickle.loads(s)
    print b
    print b == a

but that's very slow because it first converts the array to a list and
which is then pickled.  For large arrays this takes too much space to
consider.

You'll have to consider: is it important to be able to read pickled
arrays on previous Python releases, or it that not a requirement?  If
it's not, you should probably add a new pickle code for pickled
arrays, and do an implementation that writes;

- the pickle code (1 char)
- the array typecode (1 char)
- the array length (is 4 bytes enough?)
- the array itemsize (some typecodes aren't the same size across platforms)
- the array data (length * itemsize bytes)

(Please don't bother making a different version for non-binary pickles.)

But you'll also have to consider byte ordering and other
cross-platform issues -- pickles are supposed to be 100% cross
platform portable!

> *** Changing the 'l' and 'L' typecodes to use LONG_LONG.  There isn't a
> consistent way to get an 8 byte integer out of the array module.  About
> half of our machines at work are Alphas where long is 8 bytes, and the
> rest are Sparcs and x86's where long is 4 bytes.

I recommend using a new typecode instead -- changing an existing
typecode will break existing code.

> *** I'd really like it if the array module gave a "hard commitment" to
> the sizes of the elements instead of just sayings "at least n bytes". 
> None of the other array modules do this either.  I know Python has been
> ported to a bazillion platforms, but what are the exceptions to 'char'
> being 8 bits, 'short' being a 16 bits, 'int' being 32 bits, 'long long'
> or __int64 being 64 bits, 'float' being 32 bits, and 'double' being 64
> bits?  I know that an int is 16 bits on Win16, but does Python live
> there any more?  Even so, there is a 32 bit int type on Win16 as well.
> 
> I guess changing the docs to give a "hard commitment" to this isn't
> such a big deal to me personally, because the above are true for every
> platform I think I'll need this for (alpha, x86, sparc, mips).

It's hard to make such a commitment without complicating the code,
since C doesn't make hard commitments.  What do we do on a platform
where there simply isn't a 16-bit integer type?

Anyway, why do you need this?  If it's in the context of pickling,
maybe we can define pickling of arrays as only pickling the minimum
guaranteed data width.  (But then I'd like to get a warning when I'm
pickling an array that contains out-of-bounds values, on platforms
where the internal item width doesn't match the external width.)

> *** In the absence of fixing the 'l' and 'L' types, adding new
> typecodes ('n' and 'N' perhaps) that do use LONG_LONG.  This seems more
> backwards compatible, but all it really does is make the 'l' and 'L'
> typecodes duplicate either 'i' or 'n' depending on the platform
> specific sizeof(long).  In otherwords, if an 'n' typecode was added,
> who would want to use the 'l' one?  I suppose someone who knew they
> wanted a platform specific long.

For example, when they have companion C code that interprets the array
data as an array of C longs.

> *** I really need complex types. And more than the functionality
> provided by Numeric/Numarray, I need complex integer types.  We
> frequently read hardware that gives us complex 16 or 32 bit integers,
> and there are times when we would use 32 or 64 bit fixed point complex
> numbers.  Sometimes we scale our "soft decision" data so that it would
> fit fine in a complex 8 bit integer.  This could be easily added in one
> of two ways: either adding a 'z' prefix to the existing typecodes, or
> by creating new typecodes like such:
> 
>    'u' - complex bytes (8 bit)

Ehm, 'u' is already taken (Unicode).

>    'v' - complex shorts (16 bit)
>    'w' - complex ints (32 bit)
>    'x' - complex LONG_LONGs (64 bit)
>    'y' - complex floats (32 bits)
>    'z' - complex doubles (64 bits)
> 
> The downside to a 'z' prefix is that typecodes could now be 2
> characters 'zi', 'zb', and that would be a bigger change to the
> implementation.  It's also silly to have complex unsigned types (who
> wants complex numbers that are only in one quadrant?).

(Beats me.  But then, I don't have any use for complex numbers myself.
And why would anyone want to use complex ints? :-)

> The downside to adding 'u', 'v', 'w', 'x', 'y', 'z' is that they aren't
> very mnemonic, and the namespace for typecodes is getting pretty big.

The user-visible typecode could be a string, and the internal type
code could be something with the high bit set or whatever.  Then you'd
have localized changes.

> Also, I'm unsure how to get the elements in and out of the typecode for
> 'x' above (a 2-tuple of PyInt/PyLongs?).  Python's complex type is
> sufficient to hold the others without losing precision.

Or use a new custom class that behaves more like a complex number.
At least for input, you should also accept regular complexes.

> *** The ability to construct an array object from an existing C
> pointer.  We get our memory in all kinds of ways (valloc for page
> aligned DMA transfers, shmem etc...), and it would be nice not to copy
> in and copy out in some cases.

But then you get into ownership issues.  Who owns that memory?  Who
can free it?  What if someone calls a method on the array that
requires the memory to be resized?

But it's a useful thing to be able to do, I agree, and it shouldn't be
too hard to add a flag that says "I don't own this memory" -- which
would mean that the buffer can't be resized at all.

> *** Adding an additional signature to the array constructor that
> specifies a length instead of initial values.  
> 
>    a = array.array('d', [1, 2, 3])
> 
> would work as it currently does, but
> 
>    a = array.array('d', 30000)
> 
> would create an array of 30000 doubles.  My current hack to accomplish
> this is to create a small array and use the Sequence operation * to
> create an array of the size I really want:
> 
>    a = array.array('d', [0])*300000
> 
> Besides creating a (small) temporary array, this calls memcpy 300000
> times.  Yuk.

Yuck indeed, and I've often wanted this myself.  That could be a
simple patch.

> *** In the absence of the last one, adding a new constructor:
> 
>    a = array.xarray('d', 30000)
> 
> would create an array of 30000 doubles.

Nah, just overload the constructor.

> *** If a signature for creating large arrays is accepted, an optional
> keyword parameter to specify the value of each element:
> 
>    a = array.xarray('d', 30000, val=1.0)
> 
> The default val would be None indicating not to waste time initializing
> the array.

Sure, except I'm not sure if it's worth leaving the memory
uninitialized; this is entirely unheard of in Python.  (Except when
using the C constructor and an existing buffer, of course.)

> *** Multi-dimensional arrays would be valuable too, but this might be
> treading too much in the territory of the Numarray guys.

Yeah, and it would be a major change in the array module implementation.

> (I really wish there was a completed one size fits all of my needs
> array module.)

Since arrays are all about compromises that trade flexibility for
speed and memory footprint, you can't have a one size fits all. :-)

>  I would propose the following for multi-dimensional arrays:
> 
>    a = array.array('d', 20000, 20000)
> 
> or:
> 
>    a = array.xarray('d', 20000, 20000)

I just realized that multi-dimensional __getitem__ shouldn't be a big
deal.  The question is, given the above declaration, what a[0] should
return: the same as a[0, 0] or a copy of a[0, 0:20000] or a reference
to a[0, 0:20000].

> Well if someone authoritative tells me that all of the above is a
> great idea, I'll start working on a patch and scratch my plans to
> create a "not in house" xarray module.

It all depends on the quality of the patch.  By the time you're done
you may have completely rewritten the array module, and then the
question is, wouldn't your own xarray module have been quicker to
implement, because it doesn't need to preserve backwards
compatibility?

--Guido van Rossum (home page: http://www.python.org/~guido/)