[Numpy-discussion] ANN: Numpy 1.6.0 beta 2

Matthew Brett matthew.brett at gmail.com
Tue Apr 5 19:51:55 EDT 2011


Hi,

On Tue, Apr 5, 2011 at 4:12 PM, Christopher Barker
<Chris.Barker at noaa.gov> wrote:
> On 4/5/11 3:36 PM, josef.pktd at gmail.com wrote:
>>> I disagree that U makes no sense for binary file reading.
>
> I wasn't saying that it made no sense to have a "U" mode for binary file
> reading, what I meant is that by the python2 definition, it made no
> sense. In Python 2, the ONLY difference between binary and text mode is
> line-feed translation.

I think it's right to say that the difference between a text and a
binary file in python 2 is - none for unix, and '\r\n' -> '\n'
translation in windows.

The difference between 'rt' and 'U' is (this is for my own benefit):

For 'rt', a '\r' does not cause a line break - with 'U' - it does.
For 'rt' _not_ on Windows, '\r\n' stays the same - it is stripped to
'\n' with 'U'.

> As for Python 3:
>
>>> In python 3:
>>>
>>> 'b' means, "return byte objects"
>>> 't' means "return decoded strings"
>>>
>>> 'U' means two things:
>>>
>>> 1) When iterating by line, split lines at any of '\r', '\r\n', '\n'
>>> 2) When returning lines split this way, convert '\r' and '\r\n' to '\n'
>
> a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't'
> means "return decoded and line-feed translated unicode objects"

Right - my argument is that the behavior implied by 'U' and 't' is
conceptually separable.   'U' is for how to do line-breaks, and
line-termination translations, 't' is for whether to decode the text
or not.  In python 3.

> b) I think the line-feed conversion is done regardless of if you are
> iterating by lines, i.e. with a full-on .read(). At least that's how it
> works in py2 -- not running py3 here to test.

Yes, that looks right.

>>> If you support returning lines from a binary file (which python 3
>>> does), then I think 'U' is a sensible thing to allow - as in this
>>> case.
>
> but what is a "binary file"?

In python 3 a binary file is a file which is not decoded, and returns
bytes.  It still has a concept of a 'line', as defined by line
terminators - you can iterate over one, or do .readlines().  In python
2, as you say, a binary file is essentially the same as a text file,
with the single exception of the windows \r\n -> \n translation.

> I THINK what you are proposing is that we'd want to be able to have both
> linefeed translation and no decoding done. But I think that's impossible
> -- aren't the linefeeds themselves encoded differently with different
> encodings?

Right - so obviously if you open a utf-16 file as binary, terrible
things may happen - this was what Pauli was pointing out before.  His
point was that utf-8 is the standard, and that we probably would not
hit many other encodings.    I agree with you if you are saying that
it would be good to be able to deal with them if we can - presumably
by allowing 'rt' file objects, producing python 3 strings.

>> U looks appropriate in this case, better than the workarounds.
>> However, to me the python 3.2 docs seem to say that U only works for
>> text mode
>
> Agreed -- but I don't see the problem -- your files are either encoded
> in something that might treat newlines differently (UCS32, maybe?), in
> which case you'd want it decoded, or you are working with ascii or ansi
> or utf-8, in which case you can specify the encoding anyway.
>
> I don't understand why we'd want a binary blob for text parsing -- the
> parsing code is going to have to know something about the encoding to
> work -- it might as well get passed in to the file open call, and work
> with unicode. I suppose if we still want to assume ascii for parsing,
> then we could use 't' and then re-encode to ascii to work with it. Which
> I agree does seem heavy handed just for fixing newlines.
>
> Also, one problem I've often had with encodings is what happens if I
> think I have ascii, but really have a couple characters above 127 --
> then the default is to get an error in decoding. I'd like to be able to
> pass in a flag that either skips the un-decodable characters or replaces
> them with something, but it doesn't look like you can do that with the
> file open function in py3.
>
>> The line terminator is always b'\n' for binary files;
>
> Once you really make the distiction between text and binary, the concept
> of a "line terminator" doesn't really make sense anyway.

Well - I was arguing that, given we can iterate over lines in binary
files, then there must be the concept of what a line is, in a binary
file, and that means that we need the concept of a line terminator.

I realize this is a discussion that would have to happen on the
python-dev list...

See you,

Matthew



More information about the NumPy-Discussion mailing list