unicode strings and such

Wed Sep 12 21:14:20 EDT 2001

On Wed, 12 Sep 2001, Garth Grimm wrote:

> The environment is a Python driven web server.  We build a list of two-element tuples by running one
> file.  The tuples represent a regular expression pattern to match against, and the appropriate patch
> to use for substitution.  The other file loops through the list and modifies a unicode string as
> appropriate.
>
> The data file - written with UTF-8 encoding, representing Japanese characters:
>
> <!--$#-*-mode:python;tab-width:8;py-indent-offset:4;indent-tabs-mode:nil-*-
> repairList = [
> 	( '(ãf~ãf«ãf--)(\d{3})', '\g<1> \g<2>' ),
> 	( 'ã??ã? ã?.ã?".*ãf~ãf«ãf--', 'hit me' ),
> 	( '^ã??ã? ã?.ã?"$', '\g<0> hit me again' ),
>    ]
> -->
>
> The processing file:
>
> <!--$#-*-mode:python;tab-width:8;py-indent-offset:4;indent-tabs-mode:nil-*-
> repairList = []
> exec self.file("repair_data.html")  # This causes the data file above to be executed
>
>                                                                      for (pattern, patch) in repairList:
>      pattern = str(pattern)
>      patch = str(patch)
>      pattern = unicode(pattern,'utf-8')
>      patch = unicode(patch, 'utf-8')
>                                         patternRegex = re.compile(pattern,re.UNICODE or re.IGNORECASE)
>      qt = patternRegex.sub(patch, qt)
> -->
>
> That's what worked.
> Here's some notes on other things we tried and thought -- none of which seem to do any good.
> 1) including encoding:utf-8 in the first line of the data file didn't do anything.

I can't for the life of me understand why that would work, but anyways...

> 2) prefixing the data in the data file to be u'(ãf~ãf«ãf--)(\d{3})' didn't work at all.

No, because then you're saying that the text in the string is UCS-2, not
UTF-8.

> 3) why is the str() call needed?

I don't know, but the text that I'm seeing in the message doesn't seem to be
valid UTF-8 so I can't be sure why.

> Here's how I thought things should have worked...
>
> a) Use UTF-8 encoding on the data file and use u'^ã??ã? ã?.ã?"$' notation in it.  This would create two-element tuples
> of unicode strings.

No. See my response to point 2 above,

> b) loop through repairList, assigning references to each element of the tuple to patch and pattern,
> which would be references to unicode strings.

...

> c) go straight to the re.compile() call, since pattern references a unicode string.

Once the other issues are mopped up, certainly.

-- 
Ignacio Vazquez-Abrams  <ignacio at openservices.net>