Best ways of managing text encodings in source/regexes?

Mon Nov 26 17:27:36 EST 2007

> * When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
> the source file as UTF-8, do I still need to prefix all the strings
> constructed in the source with u as in myStr = u"blah", even when
> those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
> bother for me to do this for the complete source I'm working on, where
> I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.

> * Will python figure it out if I use different encodings in different
> modules -- say a main source file which is "# -*- coding: utf-8 -*-"
> and an imported module which doesn't say this (for which python will
> presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.

> * If I want to use a Unicode char in a regex -- say an en-dash, U+2013
> -- in an ASCII- or ISO-8859-1-encoded source file, can I say
> 
> myASCIIRegex = re.compile('[A-Z]')
> myUniRegex = re.compile(u'\u2013') # en-dash
> 
> then read the source file into a unicode string with codecs.read(),
> then expect re to match against the unicode string using either of
> those regexes if the string contains the relevant chars?  Or do I need
> to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin