Parsing unicode (devanagari) text with xml.dom.minidom
rparimi at gmail.com
rparimi at gmail.com
Sat Mar 7 20:24:03 EST 2009
Hello,
I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
>>> titles = dom.getElementsByTagName("title")
>>> for title in titles:
... print "childNode = ", title.childNodes
...
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
>>>
Python exited when it was trying to parse the following node:
<title>अन् </title>
The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I googled around for similar errors, and tried using unicode but that
didn't help either:
>>> foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.
Thanks
More information about the Python-list
mailing list