unicode strings and such

Garth Grimm garth_grimm at hp.com
Wed Sep 12 20:43:24 EDT 2001


After much trial and error, a co-worker of mine got some code to work (Python 2.0).  Unfortunately, 
neither of us actually knows why.  Our understanding suggests other things should have worked, but 
they didn't.  Here's what worked...

Background notes:

The environment is a Python driven web server.  We build a list of two-element tuples by running one 
file.  The tuples represent a regular expression pattern to match against, and the appropriate patch 
to use for substitution.  The other file loops through the list and modifies a unicode string as 
appropriate.

The data file - written with UTF-8 encoding, representing Japanese characters:

<!--$#-*-mode:python;tab-width:8;py-indent-offset:4;indent-tabs-mode:nil-*-
repairList = [
	( '(ãf~ãf«ãf--)(\d{3})', '\g<1> \g<2>' ),
	( 'ã??ã? ã?.ã?".*ãf~ãf«ãf--', 'hit me' ),
	( '^ã??ã? ã?.ã?"$', '\g<0> hit me again' ),
   ]
-->

The processing file:

<!--$#-*-mode:python;tab-width:8;py-indent-offset:4;indent-tabs-mode:nil-*-
repairList = []
exec self.file("repair_data.html")  # This causes the data file above to be executed

                                                                     for (pattern, patch) in repairList:
     pattern = str(pattern)
     patch = str(patch)
     pattern = unicode(pattern,'utf-8')
     patch = unicode(patch, 'utf-8')
                                        patternRegex = re.compile(pattern,re.UNICODE or re.IGNORECASE)
     qt = patternRegex.sub(patch, qt)
-->

That's what worked.
Here's some notes on other things we tried and thought -- none of which seem to do any good.
1) including encoding:utf-8 in the first line of the data file didn't do anything.
2) prefixing the data in the data file to be u'(ãf~ãf«ãf--)(\d{3})' didn't work at all.
3) why is the str() call needed?

Here's how I thought things should have worked...

a) Use UTF-8 encoding on the data file and use u'^ã??ã? ã?.ã?"$' notation in it.  This would create two-element tuples 
of unicode strings.
b) loop through repairList, assigning references to each element of the tuple to patch and pattern, 
which would be references to unicode strings.
c) go straight to the re.compile() call, since pattern references a unicode string.

Comments about why my thought process was wrong, and exactly what each step of the working way does, 
would be greatly appreciated.

Garth




More information about the Python-list mailing list