[Expat-discuss] Large data sets (Expat v2.0.0; compiled cygwin)

Sebastian Pipping webmaster at hartwork.org
Tue May 15 19:54:10 CEST 2007


Ben Keitch wrote:
> Can someone help me with this code. It is trying to convert an XML file of
> book data to tab-deliminated. Should be simple, but it seems to mangle
> about
> 200 of the 10000 records I give it. Supplying each record by itself, it
> works fine. I don't understand why, but not being a C programmer, I dare
> say
> I am mangling pointers, or there is a multithread issue I don't understand.

------------------------------------------------------------
I don't see any threads in your code so I don't
think this could be the case. To me it seems it
is the way you talk to Expat.
------------------------------------------------------------



> <record>
> <ISBN10>0816044384</ISBN10>
> <ISBN13>9780816044382</ISBN13>
> <EAN>9780816044382</EAN>
> ...
> </record>
>
> the data given to the data handler (and printed to stderr) is:
>
> Data: 9780816   Data: 044382
> Error : isbn10: 0816044384      isbn: 382       isbn13: 044382
> Data:
> Data: 9780816044382

------------------------------------------------------------
Expat reports (and must do so) all text in the
XML file *including whitespace*.
In your your code I found this line in the character
data handler:

   fprintf(stderr,"Data: %s\t",temp);

Let me use square brackets to visualize what Expat
is passing you:

   <ISBN13>[9780816][044382]</ISBN13>[
   ]<EAN>[9780816044382]</EAN>

That also makes sense to me since that's the only
place where the newline can come from. So to solve
this you will have to
 * Concatenate the chunks passed to the char handler
 * Look out what element you are in

But there is more you might want to re-consider in
your current code:
 * If you use a fixed buffer of <BUFSIZE> byte
   you have to make sure you don't write <len>
   bytes in if its value is greater
   (-> buffer overflows!).
 * <struct book> is pure waste of memory in case
   you can have books with only two numbers, not all.
   I would suggest to switch to pointers and dynamic
   allocation then.
 * In <void storeData> you do many calls to strcmp
   but forgot to put "else" before the "if".
   Currently a tag will be matched against "ISBN10",
   "ISBN13" and so on, even if it matched "ISBN10"
   already.



Sebastian


More information about the Expat-discuss mailing list