unicode strings and such
Garth Grimm
garth_grimm at hp.com
Thu Sep 13 11:11:10 EDT 2001
>>2) prefixing the data in the data file to be u'(ãf~ãf«ãf--)(\d{3})' didn't work at all.
>>
>
> No, because then you're saying that the text in the string is UCS-2, not
> UTF-8.
>
Hhmmmm.. That brings up a good question. Does Python support file IO using different character
encodings? Now that I think of it, I don't recall any file open methods that took encoding
parameters. When reading string literals from files is the byte stream just treated as a sequence
of integers representing character code points, which by default are Latin-1?
If true, that clears up one misconception. It would mean that the following initialization:
repairList = [
( '(ãf~ãf«ãf--)(\d{3})', '\g<1> \g<2>' ),
( 'ã??ã? ã?.ã?".*ãf~ãf«ãf--', 'hit me' ),
( '^ã??ã? ã?.ã?"$', '\g<0> hit me again' ),
]
should be thought of as making each element of the tuples an array of bytes that is being treated as
a String with each byte representing a Latin-1 character code point. The u'' notation just
stipulates that the array of bytes should be treated as a String with byte values representing UCS-2
character code points. Correct?
It would also mean that since the data file was written in UTF-8, they array of bytes wouldn't
really represent anything useful.
>
>>3) why is the str() call needed?
>>
>
> I don't know, but the text that I'm seeing in the message doesn't seem to be
> valid UTF-8 so I can't be sure why.
>
Any idea of what the str() is actually doing here? If I remove the str() statements from the
following code, I get different resuts (i.e. no substitutions will take place in qt). So either
pattern and patch (after the assignment at the beginning of the loop) aren't actually string
objects, or str() is doing something more than the API docs state -- "For strings, this returns the
string itself."
for (pattern, patch) in repairList:
pattern = str(pattern)
patch = str(patch)
pattern = unicode(pattern,'UTF-8')
patch = unicode(patch,'UTF-8')
print pattern
print patch
patternRegex = re.compile(pattern,re.UNICODE or re.IGNORECASE)
qt = patternRegex.sub(patch, qt)
>>c) go straight to the re.compile() call, since pattern references a unicode string.
>>
>
> Once the other issues are mopped up, certainly.
>
yeah. but it's the 'other issues' that I'm having a hard time following. ;-)
Thanks for the help so far.
Garth
More information about the Python-list
mailing list