Unicode strings and ascii regular expressions

Mon Jan 30 18:29:15 EST 2006

Fuzzyman wrote:

> Can someone confirm that compiled regular expressions from ascii
> strings will always (and safely) yield unicode values when matched
> against unicode strings ?
>
> I've tested it and it works - but can someone confirm that this is
> consistent and safe ? (No lurking encode errors - I assume it is only a
> decode that is done, in which case is it safe on a system that has a
> non-ascii compatible default encoding ? OTOH it would seem to me that
> that would break *everything*.)
>
> >>> import re
> >>> r = re.compile('(.*)=(.*)')
> >>> s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
> >>> c = r.match(s)
> >>> c.groups()   # yields two unicode strings
> (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
> >>> print c.groups()[0].encode('cp1252') # which encode safely
> £££

ascii patterns work just fine on unicode strings.  the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

</F>