[Python-Dev] unicode regex quickie: should a newline be the same thing as a linebreak?

Guido van Rossum guido@python.org
Tue, 30 May 2000 08:59:37 -0500


> From: "Fredrik Lundh" <effbot@telia.com>
> 
> I wrote:
> 
> > what's the best way to deal with this?  I see three alter-
> > natives:
> > 
> > a) stick to the old definition, and use chr(10) also for
> >    unicode strings
> > 
> > b) use different definitions for 8-bit strings and unicode
> >    strings; if given an 8-bit string, use chr(10); if given
> >    a 16-bit string, use the LINEBREAK predicate.
> > 
> > c) use LINEBREAK in either case.
> > 
> > I think (c) is the "right thing", but it's the only that may
> > break existing code...
> 
> I'm probably getting old, but I don't remember if anyone followed
> up on this, and I don't have time to check the archives right now.
> 
> so for the upcoming "feature complete" release, I've decided to
> stick to (a).
> 
> ...
> 
> for the next release, I suggest implementing a fourth alternative:
> 
> d) add a new unicode flag.  if set, use LINEBREAK.  otherwise,
>    use chr(10).
> 
> background: in the current implementation, this decision has to
> be made at compile time, and a compiled expression can be used
> with either 8-bit strings or 16-bit strings.
> 
> a fifth alternative would be to use the locale flag to tell the
> difference between unicode and 8-bit characters:
> 
> e) if locale is not set, use LINEBREAK.  otherwise, use chr(10).
> 
> comments?

I proposed before to see what Perl does -- since we're supposedly
following Perl's RE syntax anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)