[newbie] String to binary conversion

Tue Aug 7 16:17:37 EDT 2012

Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道：
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> 
> 
> 
> > If I have a string "abcd" then, with 8-bit encoding of each character,
> 
> > there is a corresponding 32-bit binary integer. How could I best obtain
> 
> > that integer and from that integer backwards again obtain the original
> 
> > string? Thanks in advance.
> 
> 
> 
> First you have to know the encoding, as that will define the integers you 
> 
> get. There are many 8-bit encodings, but of course they can't all encode 
> 
> arbitrary 4-character strings. Since there are tens of thousands of 
> 
> different characters, and an 8-bit encoding can only code for 256 of 
> 
> them, there are many strings that an encoding cannot handle.
> 
> 
> 
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
> 
> 
> 
> Sticking to one-byte encodings: since most of them are compatible with 
> 
> ASCII, examples with "abcd" aren't very interesting:
> 
> 
> 
> py> 'abcd'.encode('latin1')
> 
> b'abcd'
> 
> 
> 
> Even though the bytes object b'abcd' is printed as if it were a string, 
> 
> it is actually treated as an array of one-byte ints:
> 
> 
> 
> py> b'abcd'[0]
> 
> 97
> 
> 
> 
> Here's a more interesting example, using Python 3: it uses at least one 
> 
> character (the Greek letter π) which cannot be encoded in Latin1, and two 
> 
> which cannot be encoded in ASCII:
> 
> 
> 
> py> "aπ©d".encode('iso-8859-7')
> 
> b'a\xf0\xa9d'
> 
> 
> 
> Most encodings will round-trip successfully:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('iso-8859-7') == text
> 
> True
> 
> 
> 
> 
> 
> (although the ability to round-trip is a property of the encoding itself, 
> 
> not of the encoding system).
> 
> 
> 
> Naturally if you encode with one encoding, and then decode with another, 
> 
> you are likely to get different strings:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('latin1')
> 
> 'að©Z!'
> 
> py> data.decode('iso-8859-14')
> 
> 'aŵ©Z!'
> 
> 
> 
> 
> 
> Both the encode and decode methods take an optional argument, errors, 
> 
> which specify the error handling scheme. The default is errors='strict', 
> 
> which raises an exception. Others include 'ignore' and 'replace'.
> 
> 
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
> 
> b'aZ!'
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
> 
> b'a????Z!'
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道：
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> 
> 
> 
> > If I have a string "abcd" then, with 8-bit encoding of each character,
> 
> > there is a corresponding 32-bit binary integer. How could I best obtain
> 
> > that integer and from that integer backwards again obtain the original
> 
> > string? Thanks in advance.
> 
> 
> 
> First you have to know the encoding, as that will define the integers you 
> 
> get. There are many 8-bit encodings, but of course they can't all encode 
> 
> arbitrary 4-character strings. Since there are tens of thousands of 
> 
> different characters, and an 8-bit encoding can only code for 256 of 
> 
> them, there are many strings that an encoding cannot handle.
> 
> 
> 
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
> 
> 
> 
> Sticking to one-byte encodings: since most of them are compatible with 
> 
> ASCII, examples with "abcd" aren't very interesting:
> 
> 
> 
> py> 'abcd'.encode('latin1')
> 
> b'abcd'
> 
> 
> 
> Even though the bytes object b'abcd' is printed as if it were a string, 
> 
> it is actually treated as an array of one-byte ints:
> 
> 
> 
> py> b'abcd'[0]
> 
> 97
> 
> 
> 
> Here's a more interesting example, using Python 3: it uses at least one 
> 
> character (the Greek letter π) which cannot be encoded in Latin1, and two 
> 
> which cannot be encoded in ASCII:
> 
> 
> 
> py> "aπ©d".encode('iso-8859-7')
> 
> b'a\xf0\xa9d'
> 
> 
> 
> Most encodings will round-trip successfully:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('iso-8859-7') == text
> 
> True
> 
> 
> 
> 
> 
> (although the ability to round-trip is a property of the encoding itself, 
> 
> not of the encoding system).
> 
> 
> 
> Naturally if you encode with one encoding, and then decode with another, 
> 
> you are likely to get different strings:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('latin1')
> 
> 'að©Z!'
> 
> py> data.decode('iso-8859-14')
> 
> 'aŵ©Z!'
> 
> 
> 
> 
> 
> Both the encode and decode methods take an optional argument, errors, 
> 
> which specify the error handling scheme. The default is errors='strict', 
> 
> which raises an exception. Others include 'ignore' and 'replace'.
> 
> 
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
> 
> b'aZ!'
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
> 
> b'a????Z!'
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.

Actually for each kind of  some character encoding, 
please develop a codec to UTF-8 or UTF-16.

It means one can make conversions between any two of  the qualified 
character sets.