[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Charles R Harris charlesr.harris at gmail.com
Sat Jan 25 11:33:40 EST 2014


On Thu, Jan 23, 2014 at 11:49 AM, Chris Barker <chris.barker at noaa.gov>wrote:

> Thanks for poking into this all. I've lost track a bit, but I think:
>
> The 'S' type is clearly broken on py3 (at least). I think that gives us
> room to change it, and backward compatibly is less of an issue because it's
> broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
> but I suspect in this case, not --  the code the "works fine" on py3 with
> the 'S' type is probably only lucky that it hasn't encountered the issues
> yet.
>
> And no matter how you slice it, code being ported to py3 needs to deal
> with text handling issues.
>
> But here is where we stand:
>
> The 'S' dtype:
>
>  - was designed for one-byte-per-char text data.
>  - was mapped to the py2 string type.
>  - used the classic C null-terminated approach.
>  - can be used for arbitrary bytes (as the py2 string type can), but not
> quite, as it truncates null bytes -- so it really a bad idea to use it that
> way.
>
> Under py3:
>   The 'S' type maps to the py3 bytes type, because that's the closest to
> the py2 string type. But it also does some inconsistent things with
> encoding, and does treat a lot of other things as text. But the py3 bytes
> type does not have the same text handling as the py2 string type, so things
> like:
>
> s = 'a string'
> np.array((s,), dtype='S')[0] == s
>
> Gives you False, rather than True on py2. This is because a py3 string is
> translated to the 'S' type (presumable with the default encoding, another
> maybe not a good idea, but returns a bytes object, which does not compare
> true to a py3 string. YOu can work aroudn this with varios calls to
> encode() and decode, and/or using b'a string', but that is ugly, kludgy,
> and doesn't work well with the py3 text model.
>
>
> The py2 => py3 transition separated bytes and strings: strings are
> unicode, and bytes are not to be used for text (directly). While there is
> some text-related functionality still in bytes, the core devs are quite
> clear that that is for special cases only, and not for general text
> processing.
>
> I don't think numpy should fight this, but rather embrace the py3 text
> model. The most natural way to do that is to use the existing 'U' dtype for
> text. Really the best solution for most cases. (Like the above case)
>
> However, there is a use case for a more efficient way to deal with text.
> There are a couple ways to go about that that have been brought up here:
>
> 1: have a more efficient unicode dtype: variable length,
> multiple encoding options, etc....
>     - This is a fine idea that would support better text handling in
> numpy, and _maybe_ better interaction with external libraries (HDF, etc...)
>
> 2: Have a one-byte-per-char text dtype:
>   - This would be much easier to implement  fit into the current numpy
> model, and satisfy a lot of common use cases for scientific data sets.
>
>
We could certainly do both, but I'd like to see (2) get done sooner than
> later....
>

This is pretty much my sense of things at the moment. I think 1) is needed
in the long term but that 2) is a quick fix that solves most problems in
the short term.


>
> A related issue is whether numpy needs a dtype analogous to py3 bytes --
> I'm still not sure of the use-case there, so can't comment -- would it need
> to be fixed length (fitting into the numpy data model better) or variable
> length, or ??? Some folks are (apparently) using the current 'S' type in
> this way, but I think that's ripe for errors, due to the null bytes issue.
> Though maybe there is a null-bytes-are-special binary format that isn't
> text -- I have no idea.
>
> So what do we  do with 'S'? It really is pretty broken, so we have a
> couple choices:
>
>  (1)  depricate it, so that it stays around for backward compatibility
> but encourage people to either use 'U' for text, or one of the new dtypes
> that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
> and use either uint8 or the new bytes dtype that is yet to be implemented.
>
>  (2) fix it -- in this case, I think we need to be clear what it is:
>      -- A one-byte-char-text type? If so, it should map to a py3 string,
> and have a defined encoding (ascii or latin-1, probably), or even better a
> settable encoding (but only for one-byte-per-char encodings -- I don't
> think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
> length. (there is some room for debate here, as the 'S' type is fixed
> length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
> long as it doesn't partially truncate in teh middle of a charactor)
>

I think we should make it a one character encoded type compatible with str
in python 2, and maybe latin-1 in python 3. I'm thinking latin-1 because of
pep 393 where it is effectively a UCS-1, but ascii might be a bit more
flexible because it is a subset of utf-8 and might serve better in python 2.


>    -- a bytes type? in which  case, we should clean out all teh
> automatic conversion to-from text that iare in it now.
>
>
I'm not sure what to do about a bytes type.


> I vote for it being our one-byte text type -- it almost is already, and it
> would make the easiest transition for folks from py2 to py3. But backward
> compatibility is backward compatibility.
>
>
Not sure what to do here. It would be nice if S was a string type of given
encoding. Might be worth an experiment to see how much breaks.


> > numpy arrays need a decode and encode method
>
>
> I'm not sure that they do. Rather there needs to be a text dtype that
>> knows what encoding to use in order to have a binary interface as
>> exposed by .tostring() and friends and but produce unicode strings
>> when indexed from Python code. Having both a text and a binary
>> interface to the same data implies having an encoding.
>
>
> I  agree with Oscar here -- let's not conflate encode and decoded data --
> the py3 text model is a fine one, we should work with it as much
> as practical.
>
> UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
> to use it to store encoded text (just like the py3 bytes types), in which
> case it would be good to have encode() and decode() methods or ufuncs --
> probably  ufuncs. But that should be for special purpose, at the I/O
> interface kind of stuff.
>
>
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140125/e7b01fa7/attachment.html>


More information about the NumPy-Discussion mailing list