[Expat-discuss] junk after document element at line 2053

Tue May 18 10:02:09 EDT 2004

I haven't seen anything in the C API that would allow for ignoring well-formedness. It would seem unlikely that a parser would allow something like that since the spec says :
"Validating and non-validating processors alike MUST report violations of this specification's well-formedness constraints in the content of the document entity and any other parsed entities that they read."
(see : http://www.w3.org/TR/REC-xml/ )

I suppose it could be argued that all it says is that violations must be reported - it doesn't say parsing has to fail ... 
In the C API there is a newer function call XML_ParserReset which will allow the reuse of a parser. The header says that "All handlers are cleared from the parser, except for the unknownEncodingHandler" (see : expat.h) so you would need to re-register your handlers but you wouldn't have the overhead
of creating a new parser for each file.

-----Original Message-----
From: Dan Bolser [mailto:dmb at mrc-dunn.cam.ac.uk]
Sent: Tuesday, May 18, 2004 2:56 AM
To: Greg Martin
Cc: expat-discuss at libexpat.org
Subject: RE: [Expat-discuss] junk after document element at line 2053

On Mon, 17 May 2004, Greg Martin wrote:

>A well-formed XML document has only one top level tag (as you've
>discovered). I think that you can only have a prolog at the beginning of
>a document (which would probably justify the name prolog) which would
>mean that if you wrapped three files in a top-level tag and any had
>prolog's it probably wouldn't be well-formed either. If there was the
>possibility of any of the files having a prolog you might be better off
>to instantiate a new parser for each file.

Yup, I found this out too... (I guess by prolog you mean something like:-

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">

Sadly this occurs for every XML document in the file, and makes the parser
unhappy even when I wrap the whole file in a top level tag.

In the end I stripped out all the lines like the above from the file
(from 1000's of individual  XML documents), then I did somthing like

cat "<Start>" multi_xml_document_files_(with_prologs_removed) "</Start>" | my_xml_parser.plx

Except that exact syntax won't work, but you get the idea.

How could I request some XML::Parser options to make its checking less
strict? Is this a bad road to go down?

Thanks very much,
Dan.

>
>-----Original Message-----
>From: expat-discuss-bounces at libexpat.org
>[mailto:expat-discuss-bounces at libexpat.org]On Behalf Of Dan Bolser
>Sent: Sunday, May 16, 2004 4:51 PM
>To: expat-discuss at libexpat.org
>Subject: [Expat-discuss] junk after document element at line 2053
>
>
>
>
>junk after document element at line 2053, column 0, byte 107114 at
>/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/XML/Parser.pm
>line 185
>
>
>I get the above where the first xml document ends and the next begins.
>
>I am trying to parse the file with perl XML::Parser
>
>I want the parser to simply keep going past the first document and onto
>the second...
>
>Could I just wrap the whole file in XML document tags?
>
>Sorry for my ignorance, but how can I do this?
>
>Suppose file1, file2 and file3 all contain multiple concatenated XML
>documents, how do I create a fourth file (file4) to 'pull in' file[1-3] ?
>
>This sounds familiar, but I have ~ zero XML experience.
>
>Thanks for any suggestions,
>
>Dan.
>
>
>_______________________________________________
>Expat-discuss mailing list
>Expat-discuss at libexpat.org
>http://mail.libexpat.org/mailman/listinfo/expat-discuss
>
>
>
>_______________________________________________
>Expat-discuss mailing list
>Expat-discuss at libexpat.org
>http://mail.libexpat.org/mailman/listinfo/expat-discuss
>