[Python-3000] Regular expressions, py3k and unicode
Guido van Rossum
guido at python.org
Sun Jun 29 00:16:39 CEST 2008
On Sat, Jun 28, 2008 at 1:45 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Several posters (including a certain GvR) in the bug tracker (*) have been
> baffled by an apparent bug where the re.IGNORECASE flag didn't imply
> case-insensitivity for non-ASCII characters. It turns out that, although the
> pattern was a string object and although Py3k is supposed to be
> unicode-friendly, you still need to supply the re.UNICODE flag if you want the
> re module to use unicode-aware case-insensitive matching.
>
> Wouldn't it be more natural that, at least when the pattern is a str object
> rather a bytes object, the re.UNICODE be implied by default?
+1
> (*) http://bugs.python.org/issue2834
>
>
> Another question in the same vein: is it normal that we can match a bytes object
> with an str pattern and vice-versa?
>
> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
> pat.match('á'.encode('latin1'))
> # gives <_sre.SRE_Match object at 0xb7c66c60>
>
> pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE)
> pat.match('á')
> # gives <_sre.SRE_Match object at 0xb7c66c60>
This made sense in 2.x where text could be represented by str or
unicode. It makes a lot less sense now, and I suspect it can cause
widespread confusion. Forbidding this would also be another step in
the direction we're already taking of never allowing implicit
conversion between str and bytes.
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-3000
mailing list