[Numpy-discussion] A one-byte string dtype?

Tue Jan 21 07:30:08 EST 2014

On Tue, Jan 21, 2014 at 11:41:30AM +0000, Nathaniel Smith wrote:
> On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin at gmail.com> wrote:
> > If the Numpy array would manage the buffers itself then that per string
> memory
> > overhead would be eliminated in exchange for an 8 byte pointer and at
> least 1
> > byte to represent the length of the string (assuming you can somehow use
> > Pascal strings when short enough - null bytes cannot be used). This gives
> an
> > overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
> memory
> > if the strings are more than 3 characters long and you get at least a 50%
> > saving for strings longer than 9 characters.
> 
> There are various optimisations possible as well.
> 
> For ASCII strings of up to length 8, one could also use tagged pointers to
> eliminate the lookaside buffer entirely. (Alignment rules mean that
> pointers to allocated buffers always have the low bits zero; so you can
> make a rule that if the low bit is set to one, then this means the
> "pointer" itself should be interpreted as containing the string data; use
> the spare bit in the other bytes to encode the length.)
> 
> In some cases it may also make sense to let identical strings share
> buffers, though this adds some overhead for reference counting and
> interning.

Would this new dtype have an opaque memory representation? What would happen
in the following:

>>> a = numpy.array(['CGA', 'GAT'], dtype='s')

>>> memoryview(a)

>>> with open('file', 'wb') as fout:
...     a.tofile(fout)

>>> with open('file', 'rb') as fin:
...     a = numpy.fromfile(fin, dtype='s')

Should there be a different function for creating such an array from reading a
text file? Or would you just need to use fromiter:

>>> with open('file', encoding='utf-8') as fin:
...     a = numpy.fromiter(fin, dtype='s')

>>> with open('file', encoding='utf-8') as fout:
...     fout.writelines(line + '\n' for line in a)

(Note that the above would not be reversible if the strings contain newlines)

I think it Would be less confusing to use dtype='u' than dtype='U' in order to
signify that it is an optimised form of the 'U' dtype as far as access from
Python code is concerned? Calling it 's' only really makes sense if there is a
plan to deprecate dtype='S'.

How would it behave in Python 2? Would it return unicode strings there as
well?

Oscar