[Expat-discuss] junk after document element at line 2053

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Tue May 18 10:33:57 EDT 2004


On Tue, 18 May 2004, Greg Martin wrote:

>I haven't seen anything in the C API that would allow for ignoring
>well-formedness. It would seem unlikely that a parser would allow
>something like that since the spec says : "Validating and non-validating
>processors alike MUST report violations of this specification's
>well-formedness constraints in the content of the document entity and
>any other parsed entities that they read." (see :
>http://www.w3.org/TR/REC-xml/ )
>
>I suppose it could be argued that all it says is that violations must be
>reported - it doesn't say parsing has to fail ...  In the C API there is
>a newer function call XML_ParserReset which will allow the reuse of a
>parser. The header says that "All handlers are cleared from the parser,
>except for the unknownEncodingHandler" (see : expat.h) so you would need
>to re-register your handlers but you wouldn't have the overhead of
>creating a new parser for each file.


The real snag is the multiple xml documents in each file (or is that what
you mean). It would be nice to be able to set a 'severity' switch, so the
parser keeps on going regardless.

One other thing, I often have to deal with character lines being
arbitarily broken over multiple character event calls (even when each
string is very short). Is there any way to reset the internal character
thingie to ensure this dosn't happen? Else I just use the reworked code I
have, building up charater data as it comes and processing on close tag
events.

Cheers,
Dan.


>
>
>
>-----Original Message-----
>From: Dan Bolser [mailto:dmb at mrc-dunn.cam.ac.uk]
>Sent: Tuesday, May 18, 2004 2:56 AM
>To: Greg Martin
>Cc: expat-discuss at libexpat.org
>Subject: RE: [Expat-discuss] junk after document element at line 2053
>
>
>On Mon, 17 May 2004, Greg Martin wrote:
>
>>A well-formed XML document has only one top level tag (as you've
>>discovered). I think that you can only have a prolog at the beginning of
>>a document (which would probably justify the name prolog) which would
>>mean that if you wrapped three files in a top-level tag and any had
>>prolog's it probably wouldn't be well-formed either. If there was the
>>possibility of any of the files having a prolog you might be better off
>>to instantiate a new parser for each file.
>
>Yup, I found this out too... (I guess by prolog you mean something like:-
>
><?xml version="1.0"?>
><!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
>
>Sadly this occurs for every XML document in the file, and makes the parser
>unhappy even when I wrap the whole file in a top level tag.
>
>In the end I stripped out all the lines like the above from the file
>(from 1000's of individual  XML documents), then I did somthing like
>
>cat "<Start>" multi_xml_document_files_(with_prologs_removed) "</Start>" | my_xml_parser.plx
>
>Except that exact syntax won't work, but you get the idea.
>
>How could I request some XML::Parser options to make its checking less
>strict? Is this a bad road to go down?
>
>Thanks very much,
>Dan.
>
>>
>>-----Original Message-----
>>From: expat-discuss-bounces at libexpat.org
>>[mailto:expat-discuss-bounces at libexpat.org]On Behalf Of Dan Bolser
>>Sent: Sunday, May 16, 2004 4:51 PM
>>To: expat-discuss at libexpat.org
>>Subject: [Expat-discuss] junk after document element at line 2053
>>
>>
>>
>>
>>junk after document element at line 2053, column 0, byte 107114 at
>>/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/XML/Parser.pm
>>line 185
>>
>>
>>I get the above where the first xml document ends and the next begins.
>>
>>I am trying to parse the file with perl XML::Parser
>>
>>I want the parser to simply keep going past the first document and onto
>>the second...
>>
>>Could I just wrap the whole file in XML document tags?
>>
>>Sorry for my ignorance, but how can I do this?
>>
>>Suppose file1, file2 and file3 all contain multiple concatenated XML
>>documents, how do I create a fourth file (file4) to 'pull in' file[1-3] ?
>>
>>This sounds familiar, but I have ~ zero XML experience.
>>
>>Thanks for any suggestions,
>>
>>Dan.
>>
>>
>>_______________________________________________
>>Expat-discuss mailing list
>>Expat-discuss at libexpat.org
>>http://mail.libexpat.org/mailman/listinfo/expat-discuss
>>
>>
>>
>>_______________________________________________
>>Expat-discuss mailing list
>>Expat-discuss at libexpat.org
>>http://mail.libexpat.org/mailman/listinfo/expat-discuss
>>
>
>
>




More information about the Expat-discuss mailing list