Best ways of managing text encodings in source/regexes?

Thu Dec 6 19:36:13 EST 2007

OK, for those interested in this sort of thing, this is what I now
think is necessary to work with Unicode in python.  Thanks to those
who gave feedback, and to Cliff in particular (but any remaining
misconceptions are my own!)  Here are the results of my attempts to
come to grips with this. Comments/corrections welcome...

(Note this is not about which characters one expects to match with \w
etc. when compiling regexes with python's re.UNICODE flag. It's about
the encoding of one's source strings when building regexes, in order
to match against strings read from files of various encodings.)

I/O: READ TO/WRITE FROM UNICODE STRING OBJECTS. Always use codecs to
read from a specific encoding to a python Unicode string, and use
codecs to encode to a specific encoding when writing the processed
data.  codecs.read() delivers a Unicode string from a specific
encoding, and codecs.write() will put the Unicode string into a
specific encoding (be that an encoding of Unicode such as UTF-8).

SOURCE: Save the source as UTF-8 in your editor, tell python with "# -
*- coding: utf-8 -*-", and construct all strings with u'' (or ur''
instead of r'').  Then, when you're concatenating strings constructed
in your source with strings read with codecs, you needn't worry about
conversion issues. (When concatenating byte strings from your source
with Unicode strings, python will, without an explicit decode, assume
the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)

Even if you save the source as UTF-8, tell python with "# -*- coding:
utf-8 -*-", and say myString = "blah", myString is a byte string.  To
construct a
Unicode string you must say myString = u"blah" or myString =
unicode("blah"), even if your source is UTF-8.

Typing 'u' when constructing all strings isn't too arduous, and less
effort than passing selected non-ASCII source strings to unicode() and
needing to remember where to do it.  (You could easily slip a non-
ASCII char into a byte string in your code because most editors and
default system encodings will allow this.) Doing everything in Unicode
simplifies life.

Since the source is now UTF-8, and given Unicode support in the
editor, it
doesn't matter whether you use Unicode escape sequences or literal
Unicode
characters when constructing strings, since

>>> u"á" == u"\u00E1"
True

REGEXES: I'm a bit less certain about regexes, but this is how I think
it's going to work: Now that my regexes are constructed from Unicode
strings, and those regexes will be compiled to match against Unicode
strings read with codecs, any potential problems with encoding
conversion disappears.  If I put an
en-dash into a regex built using u'', and I happen to have read the
file in
the ASCII encoding which doesn't support en-dashes, the regex simply
won't
match because the pattern doesn't exist in the /Unicode/ string served
up by
codecs.  There's no actual problem with my string encoding handling,
it just means I'm looking for the wrong chars in a Unicode string read
from a file not saved in a Unicode encoding.

tIM