Unicode strings and ascii regular expressions

Fuzzyman fuzzyman at gmail.com
Mon Jan 30 18:11:12 EST 2006


Hello all,

Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)

>>> import re
>>> r = re.compile('(.*)=(.*)')
>>> s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
>>> c = r.match(s)
>>> c.groups()   # yields two unicode strings
(u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
>>> print c.groups()[0].encode('cp1252') # which encode safely
£££


All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml




More information about the Python-list mailing list