[Numpy-discussion] bug in genfromtxt for python 3.2

Wed Mar 30 14:12:18 EDT 2011

On Wed, Mar 30, 2011 at 7:37 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Wed, Mar 30, 2011 at 10:02 AM, Ralf Gommers
> <ralf.gommers at googlemail.com> wrote:
>> On Wed, Mar 30, 2011 at 3:39 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>> Hi,
>>>
>>> On Mon, Mar 28, 2011 at 11:29 PM,  <josef.pktd at gmail.com> wrote:
>>>> numpy/lib/test_io.py    only uses StringIO in the test, no actual csv file
>>>>
>>>> If I give the filename than I get a  TypeError: Can't convert 'bytes'
>>>> object to str implicitly
>>>>
>>>>
>>>> from the statsmodels mailing list example
>>>>
>>>>>>>> data = recfromtxt(open('./star98.csv', "U"), delimiter=",", skip_header=1, dtype=float)
>>>>> Traceback (most recent call last):
>>>>>  File "<pyshell#30>", line 1, in <module>
>>>>>    data = recfromtxt(open('./star98.csv', "U"), delimiter=",",
>>>>> skip_header=1, dtype=float)
>>>>>  File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py",
>>>>> line 1633, in recfromtxt
>>>>>    output = genfromtxt(fname, **kwargs)
>>>>>  File "C:\Programs\Python32\lib\site-packages\numpy\lib\npyio.py",
>>>>> line 1181, in genfromtxt
>>>>>    first_values = split_line(first_line)
>>>>>  File "C:\Programs\Python32\lib\site-packages\numpy\lib\_iotools.py",
>>>>> line 206, in _delimited_splitter
>>>>>    line = line.split(self.comments)[0].strip(asbytes(" \r\n"))
>>>>> TypeError: Can't convert 'bytes' object to str implicitly
>>>
>>> Is the right fix for this to open a 'filename' passed to genfromtxt,
>>> as 'binary' (bytes)?
>>>
>>> If so I will submit a pull request with a fix and a test,
>>
>> Seems to work and is what was intended I think, see Pauli's
>> changes/notes in commit 0f2e7db0.
>>
>> This is ticket #1607 by the way.
>
> Thanks for making a ticket.  I've submitted a pull request for the fix
> and linked to it from the ticket.
>
> The reason I asked whether this was the correct fix was:
>
> imagine I'm working with a non-latin default encoding, and I've opened a file:
>
> fobj = open('my_nonlatin.txt', 'rt')
>
> in python 3.2.  That might contain numbers and non-latin text.   I
> can't pass that into 'genfromtxt' because it will give me this error
> above.  I can pass it is as binary but then I'll get garbled text.

I admit the string/bytes thing is still a little confusing to me, but
isn't that always going to be a problem (even with python 2.x)?
There's no way for genfromtxt to know what the encoding of an
arbitrary file is. So your choices are garbled text or an error.
Garbled text is better.

It may help to explicitly say in the docstring that this is an ASCII
routine (as it does in the source code).

Ralf

> Should those functions also allow unicode-providing files (perhaps
> with binary as default for speed)?