SAX/Python : read an xml from the end to the top

Diez B. Roggisch deets at nospam.web.de
Tue Mar 7 03:39:32 EST 2006


kepioo schrieb:
> I currently have an xml input file containing lots of data. My objectiv
> is to write a script that reports in another xml file only the data I
> am interested in. Doing this is really easy using SAX.
> 
> The input file is continuously updated. However, the other xml file
> should be updated only on request.
> 
> Everytime we run the script, we track the new elements in the input
> file and report them in the output file.
> 
> My idea was to :
> _ detect in the output file the last event reported
> _ read the input file from the end
> _ report all the new events ( since the last time the script was run).
> 
> 
> 
> Question : IS it possible to read an XML file and process it from the
> end to the beginning, using SAX????

No. And in no other XML-related technology I know of.

Generally speaking, I'd say your approach is inherently flawed. XML as a language requires well-formed documents to have 
exactly one root element. This makes it unsuitable
for e.g. logging-files, as these have no explicit "end" - except the implicit last log-entry. So you will always have 
something like this:

--- begin ---
<root>
   <entry/>
   <entry/>
--- end ---

I don't know _what_ you do, but unless you always write the whole XML-file completely new, you can't possibly write that 
  closing end-tag. So you end up with an malformed xml-document. Or you _do_ write all the file contents new each time - 
but then you'd be able to reverse the order of elements so that the last came first. But I doubt the latter, as it 
imposes a great performance-bottleneck with little gain.

SAX won't puke on you for your file being malformed, as it only learns about that when it is to late. So - you might use 
it, as when that happens you are already finished with your actual task.

But you will always have to parse it from the beginning, to catch the document header, and there is no fast-forward 
build into SAX.

So - what are your options?

  - use seperate output files for each entry, that are well-formed in themselves. Beware if you've got plenty of them 
(few K to M) that some FS might not deal well with that

  - if you can keep the file open reading all the time (because you are kind of a background process), you can read the 
contents, create a buffer and search for start-tags in that yourself. Then you can snip out the necessary portions, 
complete them with a xml-header and feed them separately.

  - if you can't keep it open, you can simulate that using the seed-function

Both the last options are somewhat cumbersome, as you have to do a lot of parsing yourself - the exact purpose one chose 
XML the first time... From that follows the last advice:

  - ditch XML. Either totally, or at least as format for the whole file. Instead, use some protocol like this:

--- begin ---
Chunk-Length: 100
<?xml version="1.0"?>
<root>... ( a 100 byte size xml document)
</root>
Chunk-Length: 200
<?xml version="1.0"?>
<root>... ( a 200 byte size xml document)
</root>
...

Then you can easily read through your document, skip unnecessary entries and extract the ones you want. Or, when keeping 
the file open, know exactly what to read for the next chunk.

Diez



More information about the Python-list mailing list