unicode "em space" in regex

"Martin v. Löwis" martin at v.loewis.de
Sun Apr 17 14:03:57 EDT 2005


Xah Lee wrote:
> Thanks. Is it true that any unicode chars can also be used inside regex
> literally?
> 
> e.g.
> re.search(ur' +',mystring,re.U)
> 
> I tested this case and apparently i can. 

Yes. In fact, when you write u"\u2003" or u" " doesn't matter
to re.search. Either way you get a Unicode object with U+2003
in it, which is processed by SRE.

> But is it true that any
> unicode char can be embedded in regex literally. (does this apply to
> the esoteric ones such as other non-printing chars and combining
> forms...)

Yes. To SRE, only the Unicode ordinal values matter. To determine
whether something matches, it needs to have the same ordinal value
in the string that you have in the expression. No interpretation
of the character is performed, except for the few characters that
have markup meaning in regular expressions (e.g. $, \, [, etc)

Regards,
Martin



More information about the Python-list mailing list