Parsing unicode (devanagari) text with xml.dom.minidom

rparimi at gmail.com rparimi at gmail.com
Sun Mar 8 17:37:30 EDT 2009


On Mar 8, 12:42 am, Stefan Behnel <stefan... at behnel.de> wrote:
> rpar... at gmail.com wrote:
> > I am trying to process an xml file that contains unicode characters
> > (seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the
> > entire content of the website into an xml file. Using
> > xml.dom.minidom,  I wrote a few lines of python code to parse out the
> > xml file, but am stuck with the following error:
>
> >>>> import xml.dom.minidom
> >>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
> >>>> titles = dom.getElementsByTagName("title")
> >>>> for title in titles:
> > ...    print "childNode = ", title.childNodes
> > ...
> > childNode =  [<DOM Text node "Sanskrit N...">]
> > childNode =  [<DOM Text node "Sanskrit N...">]
> > childNode =  []
> > childNode =  []
> > childNode =  [<DOM Text node "1-1-1">]
> > childNode =  Traceback (most recent call last):
> >   File "<stdin>", line 2, in <module>
> > UnicodeEncodeError: 'ascii' codec can't encode characters in position
> > 16-18: ordinal not in range(128)
>
> That's because you are printing it out to your console, in which case you
> need to make sure it's encoded properly for printing. repr() might also help.
>
> Regarding minidom, you might be happier with the xml.etree package that
> comes with Python2.5 and later (it's also avalable for older versions).
> It's a lot easier to use, more memory friendly and also much faster.
>
> Stefan

Thanks for the reply. I didn't realize that printing to console was
causing the problem. I am now able to parse out the relevant portions
of my xml file. Will also look at the xml.etree module.

Regards



More information about the Python-list mailing list