Unicode perplex

Mon Jun 21 16:47:37 EDT 2004

I've got an interesting little problem that I can't find an
answer to after hunting through the doc (2.3.3). I've
got a string that contains something that kind of
resembles an HTML document. On looking through
it, I find a <meta http-equiv="content-type"
content="text/html; charset=UTF-8"> tag.

The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.

I don't want to have to write a c language
extension, and I also don't want to have to write
it out to a file and read it back in. The product
involved (FIT) is distributed under the GPL[1], so
packages that don't have the same license (or
that aren't maintained across all systems which
support Python) aren't eligible.

It's also not possible to ask the service caller to
properly specify the string when they pass it to me.

Any ideas?

John Roth

[1] That wasn't my choice, so political comments
aren't relevant. Bitch at Ward Cunningham if you
want to bitch.