xml processing : too slow...

Alex Martelli aleax at aleax.it
Thu Jul 25 12:37:54 EDT 2002


Shagshag13 wrote:

>> p.parse('<fict>%s</fict>' % line, 1)
>>
>> should be satisfactory for checking this kind of "sort of
>> well-formedness", unless there are yet more specs as yet
>> unexpressed.
> 
> that's why i had done :
>>>> anotherline = '<root>' + line + '</root>'
>>>> p.Parse(anotherline, 1)
> Traceback (most recent call last):
>   File "<pyshell#14>", line 1, in ?
>     p.Parse(anotherline, 1)
> ExpatError: junk after document element: line 1, column 0
> 
> but it still don't work, as much has:

But ARE you making a new parser object p for each line you
have to parse?  I don't see the expat.ParserCreate call here.
I've already indicated a few posts ago that you need that.


>>>> p.Parse('<fict>%s</fict>' % line, 1)
> Traceback (most recent call last):
>   File "<pyshell#185>", line 1, in ?
>     p.Parse('<fict>%s</fict>' % line, 1)
> ExpatError: junk after document element: line 1, column 0

Try with a newly created parser each and evey time, as I said.


>> How would that help you diagnosed e.g.
>>         <bah thisis=notvalid>of course not</bah>
>> as not being well formed?  This is not well formed because
>> it lacks quotes around an attribute's value.  Or:
>>         <bah thisis="notvalid">&either</bah>
>> now THIS is not well formed because reference '&either'
>> is not terminated with a semicolon.  Etc, etc.
> 
> that's right i didn't address this kind of thing... :(

If you need to, then expat is most likely your best bet
(rxp might be another, but I don't know enough about it
to suggest it).  If you don't care either way, expat is
probably best anyway.  If you HAVE to accept so-called
"XML" that in fact has these or other kinds of
non-well-formedness, it's obviously a different issue.


Alex




More information about the Python-list mailing list