unicode strings and such

Thu Sep 13 11:11:10 EDT 2001

>>2) prefixing the data in the data file to be u'(ãf~ãf«ãf--)(\d{3})' didn't work at all.
>>
> 
> No, because then you're saying that the text in the string is UCS-2, not
> UTF-8.
>

Hhmmmm.. That brings up a good question.  Does Python support file IO using different character 
encodings?  Now that I think of it, I don't recall any file open methods that took encoding 
parameters.  When reading string literals from files is the byte stream just treated as a sequence 
of integers representing character code points, which by default are Latin-1?

If true, that clears up one misconception.  It would mean that the following initialization:
repairList = [
  ( '(ãf~ãf«ãf--)(\d{3})', '\g<1> \g<2>' ),
  ( 'ã??ã? ã?.ã?".*ãf~ãf«ãf--', 'hit me' ),
  ( '^ã??ã? ã?.ã?"$', '\g<0> hit me again' ),
]

should be thought of as making each element of the tuples an array of bytes that is being treated as 
a String with each byte representing a Latin-1 character code point.  The u'' notation just 
stipulates that the array of bytes should be treated as a String with byte values representing UCS-2 
character code points.  Correct?

It would also mean that since the data file was written in UTF-8, they array of bytes wouldn't 
really represent anything useful.

> 
>>3) why is the str() call needed?
>>
> 
> I don't know, but the text that I'm seeing in the message doesn't seem to be
> valid UTF-8 so I can't be sure why.
> 

Any idea of what the str() is actually doing here?  If I remove the str() statements from the 
following code, I get different resuts (i.e. no substitutions will take place in qt).  So either 
pattern and patch (after the assignment at the beginning of the loop) aren't actually string 
objects, or str() is doing something more than the API docs state -- "For strings, this returns the 
string itself."

for (pattern, patch) in repairList:
     pattern = str(pattern)
     patch = str(patch)
     pattern = unicode(pattern,'UTF-8')
     patch = unicode(patch,'UTF-8')
     print pattern
     print patch
     patternRegex = re.compile(pattern,re.UNICODE or re.IGNORECASE)
     qt = patternRegex.sub(patch, qt)

>>c) go straight to the re.compile() call, since pattern references a unicode string.
>>
> 
> Once the other issues are mopped up, certainly.
> 
yeah. but it's the 'other issues' that I'm having a hard time following. ;-)

Thanks for the help so far.

Garth