Unicode perplex

John Roth newsgroups at jhrothjr.com
Mon Jun 21 17:57:42 EDT 2004


"Irmen de Jong" <irmen at -nospam-remove-this-xs4all.nl> wrote in message
news:40d74e5d$0$568$e4fe514c at news.xs4all.nl...
> John Roth wrote:
>
> > Remember that the trick
> > is that it's still going to have the *same* stream of
> > bytes (at least if the Unicode string is implemented
> > in UTF-8.)
>
> Which it isnt't.
>
> AFAIK Python's storage format for Unicode strings is
> some form of 2-byte representation, it certainly isn't
> UTF-8.
>
> So if you want to turn your string into a Python Unicode
> object, you really have to push it trough the UTF-8 codec...

I see. I'm really very much a novice at unicode and all
the codec stuff. If I understand you, I need to get the
utf-8 codec and use the decode function to turn it into
a unicode string, and then use the encode function to
turn it back to a standard 8-byte string so I can write
it out (or send it down the pipe or socket...)

Thanks. Now that you point it out, it does look kind
of obvious - the second time.

John Roth
>
> --Irmen





More information about the Python-list mailing list