[Expat-discuss] Large data sets (Expat v2.0.0; compiled cygwin)
Sebastian Pipping
webmaster at hartwork.org
Tue May 15 19:54:10 CEST 2007
Ben Keitch wrote:
> Can someone help me with this code. It is trying to convert an XML file of
> book data to tab-deliminated. Should be simple, but it seems to mangle
> about
> 200 of the 10000 records I give it. Supplying each record by itself, it
> works fine. I don't understand why, but not being a C programmer, I dare
> say
> I am mangling pointers, or there is a multithread issue I don't understand.
------------------------------------------------------------
I don't see any threads in your code so I don't
think this could be the case. To me it seems it
is the way you talk to Expat.
------------------------------------------------------------
> <record>
> <ISBN10>0816044384</ISBN10>
> <ISBN13>9780816044382</ISBN13>
> <EAN>9780816044382</EAN>
> ...
> </record>
>
> the data given to the data handler (and printed to stderr) is:
>
> Data: 9780816 Data: 044382
> Error : isbn10: 0816044384 isbn: 382 isbn13: 044382
> Data:
> Data: 9780816044382
------------------------------------------------------------
Expat reports (and must do so) all text in the
XML file *including whitespace*.
In your your code I found this line in the character
data handler:
fprintf(stderr,"Data: %s\t",temp);
Let me use square brackets to visualize what Expat
is passing you:
<ISBN13>[9780816][044382]</ISBN13>[
]<EAN>[9780816044382]</EAN>
That also makes sense to me since that's the only
place where the newline can come from. So to solve
this you will have to
* Concatenate the chunks passed to the char handler
* Look out what element you are in
But there is more you might want to re-consider in
your current code:
* If you use a fixed buffer of <BUFSIZE> byte
you have to make sure you don't write <len>
bytes in if its value is greater
(-> buffer overflows!).
* <struct book> is pure waste of memory in case
you can have books with only two numbers, not all.
I would suggest to switch to pointers and dynamic
allocation then.
* In <void storeData> you do many calls to strcmp
but forgot to put "else" before the "if".
Currently a tag will be matched against "ISBN10",
"ISBN13" and so on, even if it matched "ISBN10"
already.
Sebastian
More information about the Expat-discuss
mailing list