Some questions about decode/encode

Sun Jan 27 06:20:36 EST 2008

On Jan 27, 9:17 pm, glacier <rong.x... at gmail.com> wrote:
> On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-... at yahoo.com.ar> wrote:
>
>
>
> > En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x... at gmail.com> escribió:
>
> > > According to your reply, what will happen if I try to decode a long
> > > string seperately.
> > > I mean:
> > > ######################################
> > > a='你好吗'*100000
> > > s1 = u''
> > > cur = 0
> > > while cur < len(a):
> > >     d = min(len(a)-i,1023)
> > >     s1 += a[cur:cur+d].decode('mbcs')
> > >     cur += d
> > > ######################################
>
> > > May the code above produce any bogus characters in s1?
>
> > Don't do that. You might be splitting the input string at a point that is
> > not a character boundary. You won't get bogus output, decode will raise a
> > UnicodeDecodeError instead.
> > You can control how errors are handled, see  http://docs.python.org/lib/string-methods.html#l2h-237
>
> > --
> > Gabriel Genellina
>
> Thanks Gabriel,
>
> I guess I understand what will happen if I didn't split the string at
> the character's boundry.
> I'm not sure if the decode method will miss split the boundry.
> Can you tell me then ?
>
> Thanks a lot.

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :-)