SAXParseException: not well-formed (invalid token)

Thu Aug 30 09:20:15 EDT 2007

	Hi Stefan,

	The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8" 
?>).

	About the possibility that you mention to recoding the input, could you 
let me know how to do it?. I am sorry I am starting with Python and I 
don't know how to do it.

	Thanks by your help.
	Pablo
		

On 30/08/2007 14:37, Stefan Behnel wrote:
> Pablo Rey wrote:
>>     I am getting the following error with a XML page:
>>
>>>   File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
>>> in getItems
>>>     d = minidom.parseString(xml.read())
>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
>>> line 967, in parseString
>>>     return _doparse(pulldom.parseString, args, kwargs)
>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
>>> line 954, in _doparse
>>>     toktype, rootNode = events.getEvent()
>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
>>> line 265, in getEvent
>>>     self.parser.feed(buf)
>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
>>> line 208, in feed
>>>     self._err_handler.fatalError(exc)
>>>   File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
>>> line 38, in fatalError
>>>     raise exception
>>> xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
>>> well-formed (invalid token)
>>
>>> def getItems(page):
>>>     opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
>>>     try:
>>>        xml = opener.open(page)
>>>     except:
>>>        return []
>>>
>>>     d = minidom.parseString(xml.read())
>>>     items = d.getElementsByTagName('item')
>>>     data = []
>>>     for i in items:
>>>        data.append(getText(i.childNodes))
>>>
>>>     return data
>>     The page is
>> https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
>> and the line with the invalid character is (the invalid character is the
>> final é of Université):
>>
>> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
>> Louvain/CN=Roberfroid</item>
>>
>>
>>     I have tried several options but I am not able to avoid this
>> problem. Any idea?.
> 
> Looks like the page is not well-formed XML (i.e. not XML at all). If it
> doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
> input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
> passing it to the SAX parser.
> 
> Alternatively, tell the page authors to fix their page.
> 
> Stefan