Best ways of managing text encodings in source/regexes?

Mon Nov 26 14:32:16 EST 2007

Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios.  Can anyone help clarify?

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?  This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars?  Or do I need
to do make all my regex patterns unicode strings, with u""?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim