Some questions about decode/encode

Thu Jan 24 01:52:22 EST 2008

On 1月24日, 下午1时41分, Ben Finney <bignose+hates-s... at benfinney.id.au>
wrote:
> Ben Finney <bignose+hates-s... at benfinney.id.au> writes:
> > glacier <rong.x... at gmail.com> writes:
>
> > > I use chinese charactors as an example here.
>
> > > >>>s1='你好吗'
> > > >>>repr(s1)
> > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
> > > >>>b1=s1.decode('GBK')
>
> > > My first question is : what strategy does 'decode' use to tell the
> > > way to seperate the words. I mean since s1 is an multi-bytes-char
> > > string, how did it determine to seperate the string every 2bytes
> > > or 1byte?
>
> > The codec you specified ("GBK") is, like any character-encoding
> > codec, a precise mapping between characters and bytes. It's almost
> > certainly not aware of "words", only character-to-byte mappings.
>
> To be clear, I should point out that I didn't mean to imply static
> tabular mappings only. The mappings in a character encoding are often
> more complex and algorithmic.
>
> That doesn't make them any less precise, of course; and the core point
> is that a character-mapping codec is *only* about getting between
> characters and bytes, nothing else.
>
> --
>  \                 "He who laughs last, thinks slowest."  -- Anonymous |
>   `\                                                                   |
> _o__)                                                                  |
> Ben Finney- 隐藏被引用文字 -
>
> - 显示引用的文字 -

thanks for your respoonse:)

When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好吗'*100000
s1 = u''
cur = 0
while cur < len(a):
    d = min(len(a)-i,1023)
    s1 += a[cur:cur+d].decode('mbcs')
    cur += d
######################################

May the code above produce any bogus characters in s1?

Thanks :)