[Patches] [ python-Patches-476814 ] foreign-platform newline support

noreply@sourceforge.net noreply@sourceforge.net
Thu, 13 Dec 2001 15:57:27 -0800


Patches item #476814, was opened at 2001-10-31 08:41
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=476814&group_id=5470

>Category: Core (C code)
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jack Jansen (jackjansen)
>Assigned to: Jack Jansen (jackjansen)
Summary: foreign-platform newline support

Initial Comment:
This patch enables Python to interpret all known
newline conventions,
CR, LF or CRLF, on all platforms.

This support is enabled by configuring with
--with-universal-newlines
(so by default it is off, and everything should behave
as usual).

With universal newline support enabled two things
happen:
- When importing or otherwise parsing .py files any
newline convention
  is accepted.
- Python code can pass a new "t" mode parameter to
open() which
  reads files with any newline convention. "t" cannot
be combined with
  any other mode flags like "w" or "+", for obvious
reasons.

File objects have a new attribute "newlines" which
contains the type of
newlines encountered in the file (or None when no
newline has been seen,
or "mixed" if there were various types of newlines).

Also included is a test script which tests both file
I/O and parsing.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2001-12-13 15:57

Message:
Logged In: YES 
user_id=31435

Back to Jack -- and sorry for sitting on it so long.  
Clearly this isn't making it into 2.2 in the core.  As I 
said on Python-Dev, I believe this needs a PEP:  the design 
decisions are debatable, so *should* be debated outside the 
Mac community too.  Note, though, that I can't stop you 
from adding it to the 2.2 Mac distribution (if you want it 
badly enough there).

If a PEP won't be written, I suggest finding someone else 
to review it again; maybe Guido.  Note that the patch needs 
doc changes too.  The patch to regrtest.py doesn't belong 
here (I assume it just slipped in).  There seems a lot of 
code in support of the f_newlinetypes member, and the value 
of that member isn't clear -- I can't imagine a good use 
for it (maybe it's a Mac thing?).  The implementation of 
Py_UniversalNewlineFread appears incorrect to me:  it reads 
n bytes *every* time around the outer loop, no matter how 
few characters are still required, and n doesn't change 
inside the loop.  The business about the GIL may be due to 
the lack of docs:  are, or are not, people supposed to 
release the GIL themselves around calls to these guys?  
It's not documented, and it appears your intent differed 
from my guess.  Finally, it would be better to call ferror
() after calling fread() instead of before it <wink>.

----------------------------------------------------------------------

Comment By: Jack Jansen (jackjansen)
Date: 2001-11-14 07:13

Message:
Logged In: YES 
user_id=45365

Here's a new version of the patch. To address your issues
one by one:
- get_line and Py_UniversalNewlineFgets are too difficult to
integrate, at leat,
I don't see how I could do it. The storage management of
get_line gets in the way.

- The global lock comment I don't understand. The
Universal... routines are
replacements for fgets() and fread(), so have nothing to do
with the interpreter lock.

- The logic of all three routines (get_line too) has changed
and I've put comments in.
I hope this addresses some of the points.

- If universal_newline is false for a certain PyFileObject
we now immedeately take
a quick exit via fgets() or fread().

There's also a new test script, that tests some more border
cases (like lines longer
than 100 characters, and a lone CR just before end of file).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-11-05 00:16

Message:
Logged In: YES 
user_id=31435

It would be better if get_line just called 
Py_UniversalNewlineFgets (when appropriate) instead of 
duplicating its logic inline.

Py_UniversalNewlineFgets and Py_UniversalNewlineFread 
should deal with releasing the global lock themselves -- 
the correct granularity for lock release/reacquire is 
around the C-level input routines (esp. for fread).

The new routines never check for I/O errors!  Why not?  It 
seems essential.

The new Fgets checks for EOF at the end of the loop instead 
of the top.  This is surprising, and I stared a long time 
in vain trying to guess why.  Setting

newlinetypes |= NEWLINE_CR;

immediately after seeing an '\r' would be as fast (instead 
of waiting to see EOF and then inferring the prior 
existence of '\r' indirectly from the state of the 
skipnextlf flag).

Speaking of which <wink>, the fobj tests in the inner loop 
waste cycles.  Set the local flag vrbls whether or not fobj 
is NULL.  When you're *out* of the inner loop you can 
simply decline to store the new masks when fobj is NULL 
(and you're already doing the latter anyway).  A test and 
branch inside the loop is much more expensive than or'ing 
in a flag bit inside the loop, ditto harder to understand.

Floating the univ_newline test out of the loop (and 
duplicating the loop body, one way for univ_newline true 
and the other for it false) would also save a test and 
branch on every character.

Doing fread one character at a time is very inefficient.  
Since you know you need to obtain n characters in the end, 
and that these transformations require reading at least n 
characters, you could very profitably read n characters in 
one gulp at the start, then switch to k at a time where k 
is the number of \r\n pairs seen since the last fread 
call.  This is easier to code than it sounds <wink>.

It would be fine by me if you included (and initialized) 
the new file-object fields all the time, whether or not 
universal newlines are configured.  I'd rather waste a few 
bytes in a file object than see #ifdefs spread thru the 
code.

I'll be damned if I can think of a quick way to do this 
stuff on Windows -- native Windows fgets() is still the 
only Windows handle we have on avoiding crushing thread 
overhead inside MS's C library.  I'll think some more about 
it (the thrust still being to eliminate the 't' mode flag, 
as whined about <wink> on Python-Dev).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-10-31 09:38

Message:
Logged In: YES 
user_id=6380

Tim, can you review this or pass it on to someone else who
has time?

Jack developed this patch after a discussion in which I was
involved in some of the design, but I won't have time to
look at it until December.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=476814&group_id=5470