UNICODE mode for regular expressions - time to change the default?

Steve Holden steve at holdenweb.com
Thu Apr 5 17:44:01 EDT 2007


John Nagle wrote:
>    Regular expressions are compiled in ASCII mode unless
> Unicode mode is specified to "rc.compile".  The difference is that regular
> expressions in ASCII mode don't recognize things like
> Unicode whitespace, even when applied to Unicode strings.
> For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
> a form of whitespace. It's the Unicode equivalent of HTML's " ".
> This can create some strange bugs.
> 
>    Is the current default good?  Or is it time to compile all regular
> expressions in Unicode mode by default?  It shouldn't hurt processing of
> ASCII strings to do that.  The current setup is really a legacy of when
> most things in Python didn't work in Unicode mode, and you didn't want to
> introduce Unicode unnecessarily.   It's another one of those obscure
> Unicode "gotchas" that really should go away.
> 
> 					John Nagle

Personally I'd leave it to go away with Python 3.0, when all strings 
will be Unicode.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Skype: holdenweb     http://del.icio.us/steve.holden
Recent Ramblings       http://holdenweb.blogspot.com




More information about the Python-list mailing list