Issue 476814: foreign-platform newline support

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/35434

classification

Title:	foreign-platform newline support
Type:		Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	barry	Nosy List:	barry, gvanrossum, jackjansen, loewis, tim.peters
Priority:	normal	Keywords:	patch

Created on 2001-10-31 16:41 by jackjansen, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
@univnewlines-20020414	jackjansen, 2002-04-13 23:31	Patch version 7
@univnewlines-doc-20020414	jackjansen, 2002-04-13 23:32	Docs v2
test_univnewlines.py	jackjansen, 2002-04-13 23:34	test script v5, 14-apr-02

Messages (13)
msg38047 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2001-10-31 16:41
This patch enables Python to interpret all known newline conventions, CR, LF or CRLF, on all platforms. This support is enabled by configuring with --with-universal-newlines (so by default it is off, and everything should behave as usual). With universal newline support enabled two things happen: - When importing or otherwise parsing .py files any newline convention is accepted. - Python code can pass a new "t" mode parameter to open() which reads files with any newline convention. "t" cannot be combined with any other mode flags like "w" or "+", for obvious reasons. File objects have a new attribute "newlines" which contains the type of newlines encountered in the file (or None when no newline has been seen, or "mixed" if there were various types of newlines). Also included is a test script which tests both file I/O and parsing.
msg38048 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2001-10-31 17:38
Logged In: YES user_id=6380 Tim, can you review this or pass it on to someone else who has time? Jack developed this patch after a discussion in which I was involved in some of the design, but I won't have time to look at it until December.
msg38049 - (view)	Author: Tim Peters (tim.peters) *	Date: 2001-11-05 08:16
Logged In: YES user_id=31435 It would be better if get_line just called Py_UniversalNewlineFgets (when appropriate) instead of duplicating its logic inline. Py_UniversalNewlineFgets and Py_UniversalNewlineFread should deal with releasing the global lock themselves -- the correct granularity for lock release/reacquire is around the C-level input routines (esp. for fread). The new routines never check for I/O errors! Why not? It seems essential. The new Fgets checks for EOF at the end of the loop instead of the top. This is surprising, and I stared a long time in vain trying to guess why. Setting newlinetypes \|= NEWLINE_CR; immediately after seeing an '\r' would be as fast (instead of waiting to see EOF and then inferring the prior existence of '\r' indirectly from the state of the skipnextlf flag). Speaking of which <wink>, the fobj tests in the inner loop waste cycles. Set the local flag vrbls whether or not fobj is NULL. When you're out of the inner loop you can simply decline to store the new masks when fobj is NULL (and you're already doing the latter anyway). A test and branch inside the loop is much more expensive than or'ing in a flag bit inside the loop, ditto harder to understand. Floating the univ_newline test out of the loop (and duplicating the loop body, one way for univ_newline true and the other for it false) would also save a test and branch on every character. Doing fread one character at a time is very inefficient. Since you know you need to obtain n characters in the end, and that these transformations require reading at least n characters, you could very profitably read n characters in one gulp at the start, then switch to k at a time where k is the number of \r\n pairs seen since the last fread call. This is easier to code than it sounds <wink>. It would be fine by me if you included (and initialized) the new file-object fields all the time, whether or not universal newlines are configured. I'd rather waste a few bytes in a file object than see #ifdefs spread thru the code. I'll be damned if I can think of a quick way to do this stuff on Windows -- native Windows fgets() is still the only Windows handle we have on avoiding crushing thread overhead inside MS's C library. I'll think some more about it (the thrust still being to eliminate the 't' mode flag, as whined about <wink> on Python-Dev).
msg38050 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2001-11-14 15:13
Logged In: YES user_id=45365 Here's a new version of the patch. To address your issues one by one: - get_line and Py_UniversalNewlineFgets are too difficult to integrate, at leat, I don't see how I could do it. The storage management of get_line gets in the way. - The global lock comment I don't understand. The Universal... routines are replacements for fgets() and fread(), so have nothing to do with the interpreter lock. - The logic of all three routines (get_line too) has changed and I've put comments in. I hope this addresses some of the points. - If universal_newline is false for a certain PyFileObject we now immedeately take a quick exit via fgets() or fread(). There's also a new test script, that tests some more border cases (like lines longer than 100 characters, and a lone CR just before end of file).
msg38051 - (view)	Author: Tim Peters (tim.peters) *	Date: 2001-12-13 23:57
Logged In: YES user_id=31435 Back to Jack -- and sorry for sitting on it so long. Clearly this isn't making it into 2.2 in the core. As I said on Python-Dev, I believe this needs a PEP: the design decisions are debatable, so should be debated outside the Mac community too. Note, though, that I can't stop you from adding it to the 2.2 Mac distribution (if you want it badly enough there). If a PEP won't be written, I suggest finding someone else to review it again; maybe Guido. Note that the patch needs doc changes too. The patch to regrtest.py doesn't belong here (I assume it just slipped in). There seems a lot of code in support of the f_newlinetypes member, and the value of that member isn't clear -- I can't imagine a good use for it (maybe it's a Mac thing?). The implementation of Py_UniversalNewlineFread appears incorrect to me: it reads n bytes every time around the outer loop, no matter how few characters are still required, and n doesn't change inside the loop. The business about the GIL may be due to the lack of docs: are, or are not, people supposed to release the GIL themselves around calls to these guys? It's not documented, and it appears your intent differed from my guess. Finally, it would be better to call ferror () after calling fread() instead of before it <wink>.
msg38052 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2002-01-16 22:47
Logged In: YES user_id=45365 This version of the patch addresses the bug in Py_UniversalNewlineFread and fixes up some minor details. Tim's other issues are addressed (at least: I think they are:-) in a forthcoming PEP.
msg38053 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2002-03-13 22:44
Logged In: YES user_id=45365 A new version of the patch. Main differences are that U is now the mode character to trigger universal newline input and --with-universal-newlines is default on.
msg38054 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-03-25 21:07
Logged In: YES user_id=6380 Thanks! But there's no documentation. Could I twist your arm for a separate doc patch? I'm tempted to give this a +1, but I'd like to hear from MvL and MAL to see if they foresee any interaction with their PEP 262 implemetation.
msg38055 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2002-03-28 23:17
Logged In: YES user_id=45365 New doc patch, and new version of the patch that mainly allows the U to be specified (no-op) in non-univ-newline-builds.
msg38056 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-11 20:31
Logged In: YES user_id=21627 What is the rationale for making this a compile-time option? It seems to complicate things, with no apparent advantage. If this is for backwards compatibility, don't make it an option: nobody will rebuild Python just to work around a compatibility problem. Apart from that, the patch looks goo.d
msg38057 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2002-04-13 23:31
Logged In: YES user_id=45365 A final tweaks: return a tuple of newline values in stead of 'mixed'.
msg38058 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2002-04-14 20:15
Logged In: YES user_id=45365 Barry, I've checked in the code fixes, but I don't dare check in the documentation fixes myself, as I have no TeX and hence no way to test them. Could you do this, please, and also check that I'm following all the relevant guidelines? Thanks!
msg38059 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-05-22 20:30
Logged In: YES user_id=12800 I'll be committing the documentation changes momentarily. You were pretty close, and fortunately I had Dr. Doco sitting next to me to clean up a few nits. I'm also going to check in a change to the builtin `file' docstring.

History
Date	User	Action	Args
2022-04-10 16:04:35	admin	set	github: 35434
2001-10-31 16:41:38	jackjansen	create