[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Julian Taylor jtaylor.debian at googlemail.com
Fri Jan 17 04:38:15 EST 2014


This thread is getting a little out of hand which is my fault for initially
mixing different topics in one mail, so let me try to summarize:
We have three issues here:

- a loadtxt bug when loading strings in python3
this has nothing to do with encodings or dtypes it is a bug that should be
fixed. Not more not less.

the fix is probably removing a repr() somewhere and converting the data to
unicode as the user requested as str == unicode in py3, this is the normal
change you must account for when migrating to p3.

- no possibility to specify the encoding of a file in loadtxt
this is a missing feature, currently it uses the system default which is
good and should stay that way.
It is only missing an option to tell it to treat it differently.
There should be little debate about changing the default, especially not
using latin1. The system default exists for a good reason. Note on linux it
is UTF-8 which is a good choice. I'm not familiar with windows but all
programs should at least have the option to use UTF-8 as output too.
This has nothing to do with indexing or any kind of processing of the numpy
arrays.

The fix should be trivial to do, just add an encoding keyword argument and
pass it on to python.
The workaround should be passing a file object to loadtxt instead of a file
name. Python file objects already have the encoding argument.

- inconvenience in dealing with strings in python 3.
bytes are not strings in python3 which means ascii data is either a byte
array which can be inconvenient to deal with or 4 byte unicode which wastes
space.
A proposal to fix this would be to add a one or two byte dtype with a
specific encoding that behaves similar to bytes but converts to string when
outputting to python for comparisons etc.
For backward compatibility we *cannot* change S. Maybe we could change the
meaning of 'a' but it would be safer to add a new dtype, possibly 'S' can
be deprecated in favor of 'B' when we have a specific encoding dtype.
The main issue is probably: is it worth it and who does the work?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/0da0ebba/attachment.html>


More information about the NumPy-Discussion mailing list