[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 10:26:05 EST 2014

On Fri, Jan 17, 2014 at 03:12:32PM +0100, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
>
> > On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > >
> > > no, the right solution is to add an encoding argument.
> > > Its a 4 line patch for python2 and a 2 line patch for python3 and the
> > issue
> > > is solved, I'll file a PR later.
> >
> > What is the encoding argument for? Is it to be used to decode, process the
> > text and then re-encode it for an array with dtype='S'?
> >
>
> it is only used to decode the file into text, nothing more.
> loadtxt is supposed to load text files, it should never have to deal with
> bytes ever.
> But I haven't looked into the function deeply yet, there might be ugly
> surprises.
>
> The output of the array is determined by the dtype argument and not by the
> encoding argument.

If the dtype is 'S' then the output should be bytes and you therefore
need to encode the text; there's no such thing as storing text in
bytes without an encoding.

Strictly speaking the 'U' dtype uses the encoding 'ucs-4' or 'utf-32'
which just happens to be as simple as expressing the corresponding
unicode code points as int32 so it's reasonable to think of it as "not
encoded" in some sense (although endianness becomes an issue in
utf-32).

On 17 January 2014 14:11,  <josef.pktd at gmail.com> wrote:
> Windows seems to use consistent en/decoding throughout (example run in IDLE)

The reason for the Py3k bytes/text overhaul is that there were lots of
situations where things *seemed* to work until someone happens to use
a character you didn't try. "Seems to" doesn't cut it! :)

> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 32 bit (Intel)] on win32
>
>>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>>> filenames
> array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
>        b'\xd5scar.txt'],
>       dtype='|S18')
>>>> fn = open(filenames[-1])
>>>> fn.read()
> '1,2,3,hello\n5,6,7,Õscar\n'
>>>> fn
> <_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>

You don't show how you created the file. I think that in your case the
content of 'filenames.txt' is correctly encoded latin-1.

My guess is that you did the same as me and opened it in text mode and
wrote the unicode string allowing Python to encode it for you. Judging
by the encoding on fn above I'd say that it wrote the file with cp1252
which is mostly compatible with latin-1. Try it with a byte that is
incompatible between cp1252 and latin-1 e.g.:

In [3]: b'\x80'.decode('cp1252')
Out[3]: '€'

In [4]: b'\x80'.decode('latin-1')
Out[4]: '\x80'

In [5]: b'\x80'.decode('cp1252').encode('latin-1')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/users/enojb/<ipython-input-5-cfd8b16d6d9f> in <module>()
----> 1 b'\x80'.decode('cp1252').encode('latin-1')

UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
position 0: ordinal not in range(256)

Oscar