[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Thu Jan 16 05:43:05 EST 2014

On Wed, Jan 15, 2014 at 11:40:58AM -0800, Chris Barker wrote:
> On Wed, Jan 15, 2014 at 9:57 AM, Charles R Harris <charlesr.harris at gmail.com
> > wrote:
> 
> 
> > There was a discussion of this long ago and UCS-4 was chosen as the numpy
> > standard. There are just too many complications that arise in supporting
> > both.
> >
> 
> fair enough -- but loadtxt appears to be broken just the same. Any
> proposals for that?
> 
> My proposal:
> 
> loadtxt accepts an encoding argument.
> 
> default is ascii -- that's what it's doing now, anyway, yes?

No it's loading the file reading a line, encoding the line with latin-1, and
then putting the repr of the resulting byte-string as a unicode string into a
UCS-4 array (dtype='<Ux'). I can't see any good reason for that behaviour.

> 
> If the file is encoded ascii, then a one-byte-per character dtype is used
> for text data, unless the user specifies otherwise (do they need to specify
> anyway?)
> 
> If the file has another encoding, the the default dtype for text is unicode.

That's a silly idea. There's already the dtype='S' for ascii that will give
one byte per character.

However numpy.loadtxt(dtype='S') doesn't actually use ascii IIUC. It loads
the file as text with the default system encoding, encodes the text with
latin-1 and stores the resulting bytes into a dtype='S' array. I think it
should just open the file in binary read the bytes and store them in the
dtype='S' array. The current behaviour strikes me as a hangover from the
Python 2.x 8-bit text model.

> Not sure about other one-byte per character encodings (e.g. latin-1)
> 
> The defaults may be moot, if the loadtxt doesn't have auto-detection of
> text in a filie anyway.
> 
> This all required that there be an obvious way for the user to spell the
> one-byte-per character dtype -- I think 'S' will do it.

They should use 'S' and not encoding='ascii'. If the user provides an encoding
then it should be used to open the file and decode it to unicode resulting in
a dtype='U' array. (Python 3 handles this all for you).

> Note to OP: what happens if you specify 'S' for your dtype, rather than str
> - it works for me on py2:
> 
> In [16]: np.loadtxt('pathlist.txt', dtype='S')
> Out[16]:
> array(['C:\\Users\\Documents\\Project\\mytextfile1.txt',
>        'C:\\Users\\Documents\\Project\\mytextfile2.txt',
>        'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
>       dtype='|S42')

It only seems to work because you're using ascii data. On Py3 you'll have byte
strings corresponding to the text in the file encoded as latin-1 (regardless
of the encoding used in the file). loadtxt doesn't open the file in binary or
specify an encoding so the file will be opened with the system default
encoding as determined by the standard builtins.open. The resulting text is
decoded according to that encoding and then reencoded as latin-1 which will
corrupt the binary form of the data if the system encoding is not compatible
with latin-1 (e.g. ascii and latin-1 will work but utf-8 will not).

> 
> Note: this leaves us with what to pass back to the user when they index
> into an array of type 'S*' -- a bytes object or a unicode object (decoded
> as ascii). I think a unicode object, in keeping with proper py3 behavior.
> This would be like we currently do with, say floating point numbers:
> 
> We can store/operate with 32 bit floats, but when you pass it back as a
> python type, you get the native python float -- 64bit.
> 
> NOTE: another option is to use latin-1 all around, rather than ascii -- you
> may get garbage from the higher value bytes, but it won't barf on you.

I guess you're alluding to the idea that reading/writing files as latin-1 will
pretend to seamlessly decode/encode any bytes preserving binary data in any
round-trip. This concept is already broken if you intend to do any processing,
indexing or slicing of the array. Additionally the current loadtxt behaviour
fails to achieve this round-trip even for the 'S' dtype even if you don't do
any processing:

$ ipython3
Python 3.2.3 (default, Sep 25 2013, 18:22:43) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: with open('tmp.py', 'w') as fout:  # Implicitly utf-8 here
    fout.write('Åå\n' * 3)
   ...:     

In [2]: import numpy

In [3]: a = numpy.loadtxt('tmp.py')
<snip>
ValueError: could not convert string to float: b'\xc5\xe5'

In [4]: a = numpy.loadtxt('tmp.py', dtype='S')

In [5]: a
Out[5]: 
array([b'\xc5\xe5', b'\xc5\xe5', b'\xc5\xe5'], 
      dtype='|S2')

In [6]: a.tostring()
Out[6]: b'\xc5\xe5\xc5\xe5\xc5\xe5'

In [7]: with open('tmp.py', 'rb') as fin:
   ...:     text = fin.read()
   ...:     

In [8]: text
Out[8]: b'\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n\xc3\x85\xc3\xa5\n'

This is a mess. I don't know about how to handle backwards compatibility but
the sensible way to handle this in *both* Python 2 and 3 is that dtype='S'
opens the file in binary, reads byte strings, and stores them in an array with
dtype='S'. dtype='U' should open the file as text with an encoding argument
(or system default if not supplied), decode the bytes and create an array with
dtype='U'. The only reasonable difference between Python 2 and 3 is which of
these two behaviours dtype=str should do.

Oscar