Some questions about decode/encode

glacier rong.xian at gmail.com
Sun Jan 27 08:49:58 EST 2008


On 1月27日, 下午7时04分, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 27, 9:18 pm, glacier <rong.x... at gmail.com> wrote:
>
>
>
>
>
> > On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>
> > > On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
> > > > My second question is: is there any one who has tested very long mbcs
> > > > decode? I tried to decode a long(20+MB) xml yesterday, which turns out
> > > > to be very strange and cause SAX fail to parse the decoded string.
>
> > > That's because SAX wants bytes, not a decoded string.  Don't decode it
> > > yourself.
>
> > > > However, I use another text editor to convert the file to utf-8 and
> > > > SAX will parse the content successfully.
>
> > > Because now you feed SAX with bytes instead of a unicode string.
>
> > > Ciao,
> > >         Marc 'BlackJack' Rintsch
>
> > Yepp. I feed SAX with the unicode string since SAX didn't support my
> > encoding system(GBK).
>
> Let's go back to the beginning. What is "SAX"? Show us exactly what
> command or code you used.
>
SAX is the package 'xml.sax' distributed with Python 2.5:)
1,I read text from a GBK encoded XML file then I skip the first line
declare the encoding.
2,I converted the string to uncode by call decode('mbcs')
3,I used xml.sax.parseString to parse the string.

########################################################################
    f = file('e:/temp/456.xml','rb')
    s = f.read()
    f.close()
    n = 0
    for i in xrange(len(s)):
        if s[i]=='\n':
            n += 1
            if n == 1:
                s = s[i+1:]
                break
    s = '<root>'+s+'</root>'
    s = s.decode('mbcs')
    xml.sax.parseString(s,handler,handler)
########################################################################


> How did you let this SAX know that the file was encoded in GBK? An
> argument to SAX? An encoding declaration in the first few lines of the
> file? Some other method? ... precise answer please. Or did you expect
> that this SAX would guess correctly what the encoding was without
> being told?
I didn't tell the SAX the file is encoded in GBK since I used the
'parseString' method.
>
> What does "didn't support my encoding system" mean? Have you actually
> tried pushing raw undecoded GBK at SAX using a suitable documented
> method of telling SAX that the file is in fact encoded in GBK? If so,
> what was the error message that you got?
I mean SAX only support a limited number of encoding such as utf-8
utf-16 etc.,which didn't include GBK.

>
> How do you know that it's GBK, anyway? Have you considered these
> possible scenarios:
> (1) It's GBK but you are telling SAX that it's GB2312
> (2) It's GB18030 but you are telling SAX it's GBK
>
Frankly speaking, I cannot tell if the file contains any GB18030
characters...^______^
> HTH,
> John- 隐藏被引用文字 -
>
> - 显示引用的文字 -




More information about the Python-list mailing list