[BangPypers] Fwd: Handling unicode characters in xml.dom

Thu Mar 20 06:27:46 CET 2008

Thanks Anand for your help. Forwarding your post to the group.

Regards,
Gurpreet Singh

---------- Forwarded message ----------
From: Anand Balachandran Pillai <abpillai at gmail.com>
Date: Wed, Mar 19, 2008 at 11:48 PM
Subject: Re: [BangPypers] Handling unicode characters in xml.dom
To: Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com>

Hi Gurpreet,

    The problem is that you have some junk characters in the file
(mostly Japanese
unicode, since the original file seems to be japanese), which are appearing
as Ctrl characters in ascii encoding. When the parser tries to parse the
file
it interprets the first Ctrl character (^S) as a newline, so it thinks
there is an
extra break in the text and produces a "not well-formed token" error.

  The way to solve this is to decode and encode the file again in a
different
encoding than ascii. I tried iso-8859-1 decoding and unicode-escape encoding
and it works. For this you need to use the services of the codecs module
since
default file objects in Python can only write ascii text.

Here is the full code...
---------------------------------------------
import codecs
import xml.dom.minidom as mdom

data =open('problem.xml').read()
f = open('problem2.xml','w')

e = codecs.EncodedFile(f, 'iso-8859-1','unicode-escape')
e.write(data)
e.close()
data = open('problem2.xml').read()
data = '\n'.join(data.split("\\r\\n"))
open('problem2.xml','w').write(data)

print mdom.parse('problem2.xml')
--------------------------------------------------

The unicode-escape encoding interprets the characters and converts
them to their hex equivalent, but it escapes newlines to the "\r\n"
character.
So we replace these chars again with "\n" by splitting data and joining it.

The modified file is saved in problem2.xml .

Btw, can you forward this to the list. I am on a slow connection hence using
html interface to gmail and hence address completion is missing.

HTH,

--Anand

On 3/19/08, Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com> wrote:
> Hi Anand,
>
> Please find attached the xml file that contains the garbage characters. Is
> there a way we can handle them?
>
> Thanks for your help.
> Gurpreet
>
> On Tue, Mar 18, 2008 at 1:22 PM, Anand Balachandran Pillai <
> abpillai at gmail.com> wrote:
>
> > Is the garbage CDATA or attribute data ?
> >
> > CDATA is like <elem>text</elem> and attribute
> > is <elem attr="value" />
> >
> > Can you pase the relevant part of the XML file here or if it is
> > small enough, the complete XML file ? Send it directly to me
> > since the list removes attachments.
> >
> > --Anand
> >
> > On Tue, Mar 18, 2008 at 11:05 AM, Gurpreet Sachdeva
> > <gurpreet.sachdeva at gmail.com> wrote:
> > > <?xml version="1.0" encoding="UTF-8"?>
> > >
> > > Still the problem exists.
> > >
> > > - Gurpreet
> > >
> > >
> > >
> > > On Tue, Mar 18, 2008 at 10:44 AM, Anand Balachandran Pillai
> > > <abpillai at gmail.com> wrote:
> > >
> > > > What is the encoding of your XML file ? i.e in the
> > > > string "<?xml version="1.0" encoding="<encoding>"?>,
> > > > what is <encoding> ?
> > > >
> > > > Make sure it is an encoding like utf-8 or iso-8859-1
> > > > which can help the parser to understand garbage
> > > > chars.
> > > >
> > > > --Anand
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Mar 18, 2008 at 10:38 AM, Gurpreet Sachdeva
> > > > <gurpreet.sachdeva at gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > Any idea how to handle the unicode characters existing in an xml
> > file
> > > while
> > > > > parsing it.
> > > > >
> > > > > This is what I am doing:
> > > > >
> > > > > from xml.dom import minidom
> > > > >
> > > > > xmlObj = minidom.parse(fileobj)
> > > > >
> > > > > And the script throws an error because of some special characters
> > ['f
> > > > > (3gpÕ¡¤ë'] present in the xml file. Any suggestion/pointers would
> be
> > > > > appreciated
> > > > >
> > > > > Thanks and Regards,
> > > > > Gurpreet Singh
> > > > > _______________________________________________
> > > > > BangPypers mailing list
> > > > > BangPypers at python.org
> > > > > http://mail.python.org/mailman/listinfo/bangpypers
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -Anand
> > > > _______________________________________________
> > > > BangPypers mailing list
> > > > BangPypers at python.org
> > > > http://mail.python.org/mailman/listinfo/bangpypers
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Gurpreet Singh
> > > _______________________________________________
> > > BangPypers mailing list
> > > BangPypers at python.org
> > > http://mail.python.org/mailman/listinfo/bangpypers
> > >
> > >
> >
> >
> >
> > --
> > -Anand
> > _______________________________________________
> > BangPypers mailing list
> > BangPypers at python.org
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
>
>
>
> --
> Thanks and Regards,
> Gurpreet Singh
>
--
-Anand

-- 
Thanks and Regards,
Gurpreet Singh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/bangpypers/attachments/20080320/ebfcd186/attachment.htm