Some questions about decode/encode

Sun Jan 27 05:20:08 EST 2008

On 1月24日, 下午5时51分, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 24, 2:49 pm, glacier <rong.x... at gmail.com> wrote:
>
> > I use chinese charactors as an example here.
>
> > >>>s1='你好吗'
> > >>>repr(s1)
>
> > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> > >>>b1=s1.decode('GBK')
>
> > My first question is : what strategy does 'decode' use to tell the way
> > to seperate the words. I mean since s1 is an multi-bytes-char string,
> > how did it determine to seperate the string every 2bytes or 1byte?
>
> The usual strategy for encodings like GBK is:
> 1. If the current byte is less than 0x80, then it's a 1-byte
> character.
> 2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
> make up a two-byte character.
> 3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
> euro character)
> 4: Current byte 0xFF: undefined
>
> Cheers,
> John

Thanks John, I will try to write a function to test if the strategy
above caused the problem I described in the 1st post:)