[Python-3000] Regular expressions, py3k and unicode

Sun Jun 29 00:16:39 CEST 2008

On Sat, Jun 28, 2008 at 1:45 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Several posters (including a certain GvR) in the bug tracker (*) have been
> baffled by an apparent bug where the re.IGNORECASE flag didn't imply
> case-insensitivity for non-ASCII characters. It turns out that, although the
> pattern was a string object and although Py3k is supposed to be
> unicode-friendly, you still need to supply the re.UNICODE flag if you want the
> re module to use unicode-aware case-insensitive matching.
>
> Wouldn't it be more natural that, at least when the pattern is a str object
> rather a bytes object, the re.UNICODE be implied by default?

+1

> (*) http://bugs.python.org/issue2834
>
>
> Another question in the same vein: is it normal that we can match a bytes object
> with an str pattern and vice-versa?
>
>  pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>  pat.match('á'.encode('latin1'))
>  # gives <_sre.SRE_Match object at 0xb7c66c60>
>
>  pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE)
>  pat.match('á')
>  # gives <_sre.SRE_Match object at 0xb7c66c60>

This made sense in 2.x where text could be represented by str or
unicode. It makes a lot less sense now, and I suspect it can cause
widespread confusion. Forbidding this would also be another step in
the direction we're already taking of never allowing implicit
conversion between str and bytes.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)