Best ways of managing text encodings in source/regexes?

Mon Nov 26 17:40:45 EST 2007

On Nov 27, 12:27 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > * When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
> > the source file as UTF-8, do I still need to prefix all the strings
> > constructed in the source with u as in myStr = u"blah", even when
> > those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
> > bother for me to do this for the complete source I'm working on, where
> > I rarely need chars outside the ISO-8859-1 range.)
>
> Depends on what you want to achieve. If you don't prefix your strings
> with u, they will stay byte string objects, and won't become Unicode
> strings. That should be fine for strings that are pure ASCII; for
> ISO-8859-1 strings, I recommend it is safer to only use Unicode
> objects to represent such strings.
>
> In Py3k, that will change - string literals will automatically be
> Unicode objects.
>
> > * Will python figure it out if I use different encodings in different
> > modules -- say a main source file which is "# -*- coding: utf-8 -*-"
> > and an imported module which doesn't say this (for which python will
> > presumably use a default encoding)?
>
> Yes, it will. The encoding declaration is per-module.
>
> > * If I want to use a Unicode char in a regex -- say an en-dash, U+2013
> > -- in an ASCII- or ISO-8859-1-encoded source file, can I say
>
> > myASCIIRegex = re.compile('[A-Z]')
> > myUniRegex = re.compile(u'\u2013') # en-dash
>
> > then read the source file into a unicode string with codecs.read(),
> > then expect re to match against the unicode string using either of
> > those regexes if the string contains the relevant chars?  Or do I need
> > to do make all my regex patterns unicode strings, with u""?
>
> It will work fine if the regular expression restricts itself to ASCII,
> and doesn't rely on any of the locale-specific character classes (such
> as \w). If it's beyond ASCII, or does use such escapes, you better make
> it a Unicode expression.
>
> I'm not actually sure what precisely the semantics is when you match
> an expression compiled from a byte string against a Unicode string,
> or vice versa. I believe it operates on the internal representation,
> so \xf6 in a byte string expression matches with \u00f6 in a Unicode
> string; it won't try to convert one into the other.
>
> Regards,
> Martin

Thanks Martin, that's a very helpful response to what I was concerned
might be an overly long query.

Yes, I'd read that in Py3k the distinction between byte strings and
Unicode strings would disappear -- I look forward to that...

Tim