[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 12:13:05 EST 2014

On Fri, Jan 17, 2014 at 10:58:25AM -0500, josef.pktd at gmail.com wrote:
> On Fri, Jan 17, 2014 at 10:26 AM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:
> > On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
> >
> > You don't show how you created the file. I think that in your case the
> > content of 'filenames.txt' is correctly encoded latin-1.
> 
> I had created it with os.listdir but deleted some lines

You used os.listdir to generate the unicode strings that you write to the
file. The underlying Win32 API returns filenames encoded as utf-16 but Python
takes care of decoding them under the hood so you just get abstract unicode
strings here in Python 3.

It is the write method of the file object that encodes the unicode strings and
hence determines the byte content of 'filenames5.txt'. You can check the
fout.encoding attribute to see what encoding it uses by default.

> Running the full script again I still get the same correct answer for fn
> ------------
> import os
> if 1:
>     with open('filenames5.txt', 'w') as fout:
>          fout.writelines([f + '\n' for f in os.listdir('.')])
> with open('filenames.txt') as fin:
>      print(fin.read())
> 
> import numpy
> 
> #filenames = numpy.loadtxt('filenames.txt')
> filenames = numpy.loadtxt('filenames5.txt', dtype='S')
> fn = open(filenames[-1])

The question is what do you get when you do:

In [1]: with open('tmp.txt', 'w') as fout:
...:     print(fout.encoding)
...:     
UTF-8

I get utf-8 by default if no encoding is specified. This means that when I
write to the file like so

In [2]: with open('tmp.txt', 'w') as fout:
   ...:     fout.write('Õscar')
   ...:  

If I read it back in binary I get different bytes from you:

In [3]: with open('tmp.txt', 'rb') as fin:
   ...:     print(fin.read())
   ...:     
b'\xc3\x95scar'

Numpy.loadtxt will correctly decode those bytes as utf-8:

In [5]: b'\xc3\x95scar'.decode('utf-8')
Out[5]: 'Õscar'

But then it reencodes them with latin-1 before storing them in the array:

In [6]: b'\xc3\x95scar'.decode('utf-8').encode('latin-1')
Out[6]: b'\xd5scar'

This byte string will not be recognised by my Linux OS (POSIX uses bytes for
filenames and an exact match is needed). So if I pass that to open() it will
fail.

<snip>
> 
> I get similar problems when I use a file that someone else has
> written, however I haven't seen much problems if I do everything on
> Windows.

If you use a proper explicit encoding then you can savetxt from any system and
loadtxt on any other without corruption.

> The main problems I get and where I don't know how it's supposed to
> work in the best way is when we get "foreign"  data.

Text data needs to have metadata specifying the encoding. This is something
that people who pass data around need to think about.

Oscar