[Expat-bugs] [ expat-Bugs-2020141 ] Characters lost during parsing

SourceForge.net noreply at sourceforge.net
Sat Mar 3 20:24:05 CET 2012


Bugs item #2020141, was opened at 2008-07-16 18:22
Message generated for change (Settings changed) made by kwaclaw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=2020141&group_id=10127

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Closed
>Resolution: Rejected
Priority: 5
Private: No
Submitted By: Andrew D. Arenson (arenson9)
Assigned to: Sebastian Pipping (hartwork)
Summary: Characters lost during parsing

Initial Comment:
Characters can be lost during parsing.

I'm going to attach a file. My example file is too big to be included through this interface, so I've made it available at:

http://miniscd.uits.iupui.edu/aarenson/example6.xml

I put the Perl program I used to demonstrate the error as an attachment on this submission, but here it is as well:

#!/usr/bin/perl -w
use XML::Parser;
my $XmlFile = shift @ARGV;
my $xp = new XML::Parser(Handlers => {Start => \&start,
                                      End   => \&end,
                                      Char  => \&cdata});
$xp->parsefile($XmlFile);

sub start { $curTag = lc($_[1]); }

sub end { $curTag = ""; }

sub cdata {
    my ($xp,$data) = @_;

    if ($curTag eq "globalid") { $ID = $data; }

    if ($data eq ".5") {
        print "ID: $ID; TAG: $curTag\n";
    }
}


When I use the above program on the example XML file, the last value in the XML file, '52.5', gets parsed as just '.5'.


I wonder if this is related to something that was reported twelve months ago on the XML::Parser bug list at:

   http://rt.cpan.org/Public/Bug/Display.html?id=28585

That bug report on the XML::Parser bug list has not been opened. It is still listed as New.

I'm sorry I don't know the version number of Expat that I'm using or how to determine it.


----------------------------------------------------------------------

Comment By: Andrew D. Arenson (arenson9)
Date: 2008-08-04 14:11

Message:
Logged In: YES 
user_id=466852
Originator: YES

Thanks! Problem solved.

With your modified show_err2.pl file, I  was still able to demonstrate the
error, but your pointing out the problem (character data handler can
receive the content of an element split among several handler calls) was
what I needed to devise a solution. I've uploaded show_err3.pl in case
anyone reading this is interested in the solution.

For completeness, here's the output when running show_err2.pl:

bash> ./show_err2.pl example6.xml
ID: LRB0008; TAG: ofc; DATA: 52.5
ID: LRB0014; TAG: ofc; DATA: 52
ID: LRB0049; TAG: ofc; DATA: 52.5
ID: LRB0061; TAG: ofc; DATA: 52
ID: LRB0079; TAG: ofc; DATA: 52.5
ID: PMP008; TAG: ofc; DATA: 52.5
ID: PMP017; TAG: ofc; DATA: 52.5
ID: PMP043; TAG: ofc; DATA: 52
ID: PMP043; TAG: ofc; DATA: .5

Here's the output when running show_err3.pl:

bash> ./show_err3.pl example6.xml
ID: LRB0008; TAG: ofc; DATA: 52.5; CC: 0
ID: LRB0014; TAG: ofc; DATA: 52; CC: 0
ID: LRB0049; TAG: ofc; DATA: 52.5; CC: 0
ID: LRB0061; TAG: ofc; DATA: 52; CC: 0
ID: LRB0079; TAG: ofc; DATA: 52.5; CC: 0
ID: PMP008; TAG: ofc; DATA: 52.5; CC: 0
ID: PMP017; TAG: ofc; DATA: 52.5; CC: 0
ID: PMP043; TAG: ofc; DATA: 52; CC: 0
ID: PMP043; TAG: ofc; DATA: 52.5; CC: 1

Basically, my solution was to keep track of how many times the character
data handler has been called since the start handler has been called. If
it's more than one, then I append the most recent character data to the
previously returned character data.

So, why does eXpat work this way? Why not only call the character data
handler once? Is it some sort of efficiency thing when working with
streams? Is it part of wanting to be able to handle values of any size?
File Added: show_err3.pl

----------------------------------------------------------------------

Comment By: Sebastian Pipping (hartwork)
Date: 2008-08-03 07:47

Message:
Logged In: YES 
user_id=1022691
Originator: NO

Hello Andrew. Are you aware that character data handlers can receive
the content of an element split among several handler calls?
For instance your "52.5" text here could come in as two calls serving
"52" first and then ".5".

I modified the script you attached to help tracing this. Please let us
know if this is happening on your machine. (On mine "52.5" is served
as a single unit from the XML file you provided.)

Best regards, Sebastian
File Added: show_err_2.pl

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=2020141&group_id=10127


More information about the Expat-bugs mailing list