negative lookahead question

Bengt Richter bokr at oz.net
Mon Apr 21 19:02:54 EDT 2003


On Mon, 21 Apr 2003 11:24:34 -0500, Skip Montanaro <skip at pobox.com> wrote:

>This re.sub call lives in Lib/stmplib.py as a way to make line endings
>canonical: 
>
>    re.sub(r'(?:\r\n|\n|\r(?!\n))', CRLF, data)
>
>This certainly seems to do what's desired, however, it looks overly complex
What is actually desired? Something that will work on a binary image of a text
file from any of Unix, Windows and Mac? How about the mixed uglies that happen
if you capture windows text in binary with \r\n CRLFs, and send it to Unix as
binary and cook it coming back to windows? This error doesn't even need a round
trip to Unix, if file modes on windows are mixed for some reason. The result is
typically /r/n -> /r/n -> /r/r/n. Should that be canonicalized as a single CRLF?

>to me.  First, the non-grouping parens are unnecessary.  Second, I don't
>think the negative lookahead assertion is required.  This simpler function
>call seems to do the trick:
>
>    re.sub(r'\r\n|\n|\r', CRLF, data)
Neither of the above do \r\r\n -> CRLF, whereas

     re.sub(r'\r+\n|\n|\r', CRLF, data)

would, I think:

 >>> import re
 >>> CRLF = '<CRLF>'
 >>> pats = [r'\r\n|\n|\r(?!\n)', r'\r\n|\n|\r', r'\r+\n|\n|\r']
 >>> data = 'line 0\r\nline 1\nline 2\rline 3\r\r\nline 4\n'
 >>> for pat in pats: print re.sub(pat,CRLF,data)
 ...
 line 0<CRLF>line 1<CRLF>line 2<CRLF>line 3<CRLF><CRLF>line 4<CRLF>
 line 0<CRLF>line 1<CRLF>line 2<CRLF>line 3<CRLF><CRLF>line 4<CRLF>
 line 0<CRLF>line 1<CRLF>line 2<CRLF>line 3<CRLF>line 4<CRLF>

I don't know what ugliness can happen ftp-ing inappropriately between
Mac and Windows, but I suppose you could get \r\n -> \r\n -> \r\n\n.
Maybe the canonicalizing  pattern should be r'\r+\n+|\n|\r' ;-)

>
>A simple test case containing a combination of different line endings seems
>to yield identical results:
>
>    >>> data = 'line 0\r\nline 1\nline 2\rline 3\r\r\nline 4\n'
>    >>> re.sub(r'\r\n|\n|\r(?!\n)',CRLF,data) == re.sub(r'\r\n|\n|\r',CRLF,data)
>    True
>
>Is there a case where the negative lookahead assertion will produce correct
>results but the simpler regular expression won't?
>
Don't see one, but that's no guarantee ;-)

Regards,
Bengt Richter




More information about the Python-list mailing list