Some questions about decode/encode

Sun Jan 27 08:50:53 EST 2008

On 1月27日, 下午7时20分, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 27, 9:17 pm, glacier <rong.x... at gmail.com> wrote:
>
>
>
>
>
> > On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-... at yahoo.com.ar> wrote:
>
> > > En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x... at gmail.com> escribió:
>
> > > > According to your reply, what will happen if I try to decode a long
> > > > string seperately.
> > > > I mean:
> > > > ######################################
> > > > a='你好吗'*100000
> > > > s1 = u''
> > > > cur = 0
> > > > while cur < len(a):
> > > >     d = min(len(a)-i,1023)
> > > >     s1 += a[cur:cur+d].decode('mbcs')
> > > >     cur += d
> > > > ######################################
>
> > > > May the code above produce any bogus characters in s1?
>
> > > Don't do that. You might be splitting the input string at a point that is
> > > not a character boundary. You won't get bogus output, decode will raise a
> > > UnicodeDecodeError instead.
> > > You can control how errors are handled, see  http://docs.python.org/lib/string-methods.html#l2h-237
>
> > > --
> > > Gabriel Genellina
>
> > Thanks Gabriel,
>
> > I guess I understand what will happen if I didn't split the string at
> > the character's boundry.
> > I'm not sure if the decode method will miss split the boundry.
> > Can you tell me then ?
>
> > Thanks a lot.
>
> *IF* the file is well-formed GBK, then the codec will not mess up when
> decoding it to Unicode. The usual cause of mess is a combination of a
> human and a text editor :-)- 隐藏被引用文字 -
>
> - 显示引用的文字 -

I guess firstly, I should check if the file I used to test is well-
formed GBK..:)