Best ways of managing text encodings in source/regexes?

Tue Nov 27 09:19:00 EST 2007

On Nov 26, 2007 4:27 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > myASCIIRegex = re.compile('[A-Z]')
> > myUniRegex = re.compile(u'\u2013') # en-dash
> >
> > then read the source file into a unicode string with codecs.read(),
> > then expect re to match against the unicode string using either of
> > those regexes if the string contains the relevant chars?  Or do I need
> > to do make all my regex patterns unicode strings, with u""?
>
> It will work fine if the regular expression restricts itself to ASCII,
> and doesn't rely on any of the locale-specific character classes (such
> as \w). If it's beyond ASCII, or does use such escapes, you better make
> it a Unicode expression.

yes, you have to be careful when writing unicode-senstive regular expressions:
http://effbot.org/zone/unicode-objects.htm

"You can apply the same pattern to either 8-bit (encoded) or Unicode
strings. To create a regular expression pattern that uses Unicode
character classes for \w (and \s, and \b), use the "(?u)" flag prefix,
or the re.UNICODE flag:

    pattern = re.compile("(?u)pattern")
    pattern = re.compile("pattern", re.UNICODE)

"

>
> I'm not actually sure what precisely the semantics is when you match
> an expression compiled from a byte string against a Unicode string,
> or vice versa. I believe it operates on the internal representation,
> so \xf6 in a byte string expression matches with \u00f6 in a Unicode
> string; it won't try to convert one into the other.
>
> Regards,
> Martin
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>