Can I get the 8bit-string representation of any unicode string

Sun Feb 12 10:48:09 EST 2006

wanghz at gmail.com wrote:

> I have a problem when I'm processing unicode strings.  Is it possible
> to get the 8bit-string representation of any unicode string?
>
> Suppose I get a unicode string:
>   a = u'\xc8\xce\xcf\xcd\xc6\xeb';
> then, by
>   a.encode('latin-1');
> I can get the 8bit-string representation of it, that is, the physical
> storage format of this string.
>
> But for another kind of unicode string, say:
>   b = u'\u4efb\u8d24\u9f50';
> I have to:
>   b.encode('utf-8')
> to get the 8bit-string format of it.

latin-1 and utf-8 are two different 8-bit representations (encodings) of
Unicode.

> Since these unicode strings are given by an external library function,
> I don't know which kind a unicode string belongs to before I get it at
> runtime.  So, I wonder if there is a unified way to get the 8bit-string
> representation, say, byte-by-byte, of any unicode string?

since the Unicode character set contains 1.1 million code points, and a
single byte can contain 256 different values, it should be fairly obvious
that there's no "8 bit byte by byte" representation of a Unicode string.
you need to decide what 8-bit encoding to use, and stick to that.

</F>