From noreply at sourceforge.net Mon Dec 5 00:40:22 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Sun, 04 Dec 2005 15:40:22 -0800 Subject: [Expat-bugs] [ expat-Bugs-1284386 ] Byte count in large XML files fails Message-ID: Bugs item #1284386, was opened at 2005-09-08 01:01 Message generated for change (Comment added) made by pointsman You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rolf Ade (pointsman) Assigned to: Karl Waclawek (kwaclaw) Summary: Byte count in large XML files fails Initial Comment: XML_GetCurrentByteIndex(XML_Parser parser) returns a long, which is at least on the most 32 bit Systems 32 bit long. That means, for XML input larger than 2 GByte file size, XML_GetCurrentByteIndex() returns does not return the right number. Sure, such big XML files will be parsed in chunks, so it is possbile, to keep track about the nr of overflows by self, but come on. It's surely a limbo dance by its own to introcude long long in a source, so portable as expat, but that would be it. If you switch to long long if avaliable for this, please consider also XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber(). They return an int, which is on most 32-byte systems 2 Gig. Though, I'm not stumbled over this two limits in real life, as I in fact did with XML_GetCurrentByteIndex(). ---------------------------------------------------------------------- >Comment By: Rolf Ade (pointsman) Date: 2005-12-04 23:40 Message: Logged In: YES user_id=13222 Karl, I followed it. I must confess, I've no clear opinion about the backward compatibilty problem. For me, that is not a problem. I distribute the expat sources with my software and link it statically (so no problem even for binary distribution). One note: I don't think, that XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber() are seldom used calls. Any resonable XML processing software should be able to deal with the case, that it get feeded with not well-formed XML input. I'd guess, most programmers use that two calls to provide a usefull error msg to their users. I'm happy with any solution. I feel, it may be to much bloat (as you said) to have all that functions twice with different return type. So, I'd favour a build flag. (With the default 32 bit changed to 64 bit right after the 2.0 release). But I'm not really sure. Yes, I still have to provide some bits to check for a long long on unix plattforms, I now. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-30 14:31 Message: Logged In: YES user_id=290026 Rolf, I opened a discussion about this on the discussion list. The main issue - IMO - is backwards compatibility. I don't know how many apps on Linux, for instance, rely on the Expat API staying binary compatible. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-28 02:01 Message: Logged In: YES user_id=13222 XML_GetCurrentByteIndex() could return -1: Of course! You're right. And it makes sense. A freshly created or reseted parser without the first XML_Parse() call returns -1 on XML_GetCurrentByteIndex(), to signal this fact: it is not right at the start of the document, but there isn't any parsing started yet. Nice detail. I should have looked at the implementation, before replying. Note: That detail isn't mentioned in the documentation. I'm fine with a signed long long. 2^63 should be big enough, for the next few weeks ;-). Re the defines: Basically yes. It's just, that I'm pretty sure, we need one round more: some configure check for long long and depending on that result defining XML_?Int64 as long long or just long. I'll look something up (but being on deadline catching, may need a bit time). ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-28 01:00 Message: Logged In: YES user_id=290026 On a 32bit CPU, 64bit integer operations are considerably slower than 32 bit operations. On the other hand XMLUpdatePosition isn't called that often - mostly when you actually request the line/column number. So, I agree - no configuration necessary. For the other point: If you look at the XML_GetCurrentByteIndex() code, it can return -1, and it is calculated using a subtraction. So in practice and theory, it must be a signed integer. XML_GetCurrentByteCount is derived from a subtraction as well, but we know it will be positive because eventEndPtr should always be larger than eventPtr. So we could risk using an unsigned integer. Just playing around, I added this to expat_external.h: #ifdef XML_USE_MSC_EXTENSIONS typedef __int64 XML_Int64; typedef unsigned __int64 XML_UInt64; #else typedef long long XML_Int64; typedef unsigned long long XML_UInt64; #endif What do you think? Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-28 00:33 Message: Logged In: YES user_id=13222 Configurable: No There is nearly no overhead: just a few variables (at max) 8 bytes long instead of 4 bytes. Also speedwise: not mesuarable. long long acceptable everywhere: Probably no Some very old or limited embedded system may not have a long long (or equivalent). Therefor we probably need defines. Byte index could be negative: don't think so. How could that happen? Byte index starts at 0 and grows. Or do I miss something.? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 23:21 Message: Logged In: YES user_id=290026 Just some notes, so that I don't forget - should it be configurable? some may not want the overhead of a 64bit integer, especially for line number and column number. - is long long acceptable everywhere else (other than VC++)? - the byte index could be negative, but not line/column number and byte count, right? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 20:31 Message: Logged In: YES user_id=290026 You are right, Rolf, it should be 64 bits even on a 32bit platform. I guess I should make a note in the docs that Expat supports > 2GB files, as long as each chunk passed to the XML_Parse routines is smaller than 2GB. There are also issues around compiling Expat on a 64bit platform, but at least for VC++, someone has provided a patch (bug # 1105135) which looks it should work on other platforms as well (just a bunch of type casts). One issue I have already seen is that VC++ 6.0 does not know about long long. Thanks for having a look at the cross-platform issue. I am trying to get Expat 2.0 released despite Fred not being active on Expat anymore. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 20:22 Message: Logged In: YES user_id=13222 Karl, Most reasonable 32bit platforms have support for file sizes > 2 GB these days even on 32. It was in fact a 32bit platform, at which I stumbled over the problem. That for your easy question. Much harder is how to slove this in a portable way. I'm afraid that may need platform depending #defines (with fallback to long). I'll go out digging what other portable software does in this case and will come back with a more concrete proposal. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 19:22 Message: Logged In: YES user_id=290026 Rolf, should the type be 64 bit integer on all platforms, or 32bit on 32bit platforms and 64bit on 64bit platforms? I think we are talking about m_parseEndByteIndex, POSITION.lineNumber and POSITION.columnNumber. Options could be size_t, ptrdiff_t. MS VC++ 6.0 does not know about long long, but it knows about __int64. Is there an ANSI definition for 64 bit ints? What do you suggest that works on all platforms? Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 From noreply at sourceforge.net Mon Dec 5 14:31:13 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 05 Dec 2005 05:31:13 -0800 Subject: [Expat-bugs] [ expat-Bugs-569461 ] OASIS XML Test Suite Message-ID: Bugs item #569461, was opened at 2002-06-15 12:34 Message generated for change (Comment added) made by nobody You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=569461&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: Accepted Priority: 5 Submitted By: Rolf Ade (pointsman) Assigned to: Karl Waclawek (kwaclaw) Summary: OASIS XML Test Suite Initial Comment: I've tested expat-1.95.3 (with xmltok.c updated to rev. 1.17, becase of bug 566240, all other files are the original 1.95.3) against the recently updated OASIS xml test suite (XML 1.0 (Second Edition) errata 20020320, W3C Conformance Test Suite 20020606), avaliable via http://www.w3.org/XML/Test/ and found a few new problems, that are not triggered by older versions of this test suite. As in previous reports, I checked all not-wellformedness tests (should all raise error) and all valid tests (should all pass) of the test-suites xmltest, ibm, sun and oasis with xmlwf -p. Especially for the well-formedness tests, I have _not_ throughout checked if the error reason, reported by expat is the expected error, but checked only mechanical, if the test has raised an error, regardless of the exact error reason. This method is clearly not perfect, and this time we have an example, that underlines this. ibm/not-wf/P32/ibm32n09.xml This is a new test, not included in previous versions. Problem is, that the standalone document declaration has the value "yes" and there is an external markup declaration of an entity (other than amp, lt, gt, apos, quot). xmlwf -p doesn't report an error. The not well-formedness problem is, that standalone="yes" means, that all informations needed to build the XML infoset must be found in the document entity (standalone="yes" doesn't mean, that the document must not have an external subset or external PE's, only that this external entities doesn't change - per attribute defaults or as in this case, entity declarations - change the info in the document entity. See the last sentence of "Well-Formedness Constraint: Entity Declared" (P68). ibm/not-wf/P68/ibm68n06.xml Same reason as the test befor. This test _was_ present in previous versions of the test suite. But with the previous version of the external subset of this test, xmlwf claimed a "syntax error" error in the external subset, which I plain can't understand (eventually an other expat bug?), but is clearly not the expected error. In the new version of the test suite, this external subset now has an XML declaration with explicite encoding (the older version had only an XML declaration without encoding) and is accepted by expat. xmltest/not-wf/not-sa/010.xml xmltest/not-wf/not-sa/011.xml This tests are new in this edition of the test suite. Unfortunately, this both tests seems to be not documented, either in the test files isself nor in the documentation file xmlconf-20020606.htm. As far as I see, this tests test "Validity Constraint: Proper Declaration/PE Nesting" (P29). xmltest/not-wf/not-sa/005.xml This test raised error with previous expat versions, but does not anymore due to the changes, discussed in bug 548690. This is intentional, according to the 548690 discussion. This test is now listed under "XML Documents with Optional Errors". The test suite documentation says: "Conforming XML 1.0 Processors are permitted to ignore certain errors, or to report them at user option. In this section of this test report are found descriptions of test cases which fit into this category. Processor behavior on such test cases does not affect conformance to the XML 1.0 (Second Edition) Recommendation, except as noted." So, according to this, it's OK, that expat doesn't report an error for this case. Since both reporting error and not reporting error are OK, it may be debatably, which behavior is more convenient for the expat user. (Karl: ;-)) sun/not-wf/not-sa03.xml This is a new test in this edition of the test suite. Unfortunately, this test seems not to be documented. As far as I see, it tests the same as xmltest/not-wf/not-sa/005.xml Tests, that still are wrong, as in previous versions are ibm/not-wf/misc/432gewf.xml sun/not-wf/uri01.xml These are already discussed in the past. Well, that's all. rolf ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2005-12-05 05:31 Message: Logged In: NO can any one of u tell me why i am getting failed case as 33... I will tell u all the errors i got...can anyone tell me how to rectify the same....and why these errors arised.. ibm/valid/P02/ibm02v01.xml ibm/valid/p28/ibm28v02.xml /p29/ 02.xml /p29/ 01.xml similarly p54, p56 , p58, p57,p70, p82 ibm/invalid/p49../p58/... xmltest/valid/sa/069.xml...76.xml...90.xml...91.xml.... ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2003-09-24 16:48 Message: Logged In: YES user_id=290026 Just changed the summary to be more generic, as this bug will probably stay open permanently, assuming we will never pass 100% of all test cases. ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2003-01-29 13:26 Message: Logged In: YES user_id=290026 This is just to report that the new release Expat 1.95.6 passes the OASIS test suite (same version - 20020606) with the exact same results as Expat 1.95.5. ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2003-01-21 11:21 Message: Logged In: YES user_id=290026 For comparison purposes, I ran the xmltest.sh test script against release 1.95.5. The results are attached as TestResults_1_95_5.txt. Discussion of results: There are many cases were "output differs"is reported, but these are due to xmlwf having a different definition of "canonical XML" than used in the test suite. Leaving these out, and those that were discussed already in this thread, we have the following errors reported: (Note: the two test cases ibm/not-wf/P32/ibm32n09.xml and ibm/not-wf/P68/ibm68n06.xml are not reported anymore by the script) * In ibm/invalid/P49/: ibm49i02.xml:7:1: error in processing external entity reference: The associated DTD file does not exist - an error in the test suite. The next three documents are not UTF-8 encoded, and do not have an XML declaration, so Expat rejects them, which is correct. An error in the test suite. * In xmltest/valid/sa/: 049.xml:2:0: not well-formed (invalid token) * In xmltest/valid/sa/: 050.xml:2:0: not well-formed (invalid token) * In xmltest/valid/sa/: 051.xml:2:0: not well-formed (invalid token) The next two documents are classified as invalid, but well-formed, but they contain faulty UTF-16 encoding, so they should be classified as not well-formed. Expat seems correct here. * In sun/invalid/: utf16b.xml:2:0: not well-formed (invalid token) * In sun/invalid/: utf16l.xml:1:40: not well-formed (invalid token) The next three are not marked as why they should fail, so the script thinks they are not well-formed, but in fact they are: * Well formed: oasis/p06fail1.xml * Well formed: oasis/p08fail1.xml * Well formed: oasis/p08fail2.xml So, no new test case errors have really been added for release 1.95.5. ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2003-01-19 11:01 Message: Logged In: YES user_id=290026 Just a comment: This bug report will likely stay open until Expat passes the OASIS test suite without any problem at all. Since no parser currently achieves this, there is a good chance this bug report will stay open for a long time to come.. ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2002-07-07 08:39 Message: Logged In: YES user_id=290026 Patch #587161 should fix some of the problems, but I specifically made no attempt to fix the problems Expat has with: - xmltest/not-wf/not-sa/010.xml and - xmltest/not-wf/not-sa/011.xml. Reason: It turns out, after consulting with the mailing list for the XML test suite, public-xml-testsuite at w3.org, that these two violate WFC: PE Between Declarations. There is no quick and easy fix for this in Expat, and I would have to spend some time thinking about it, which I don't have at the moment. Karl ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2002-06-17 20:15 Message: Logged In: YES user_id=290026 Assigned to me, but only for the three test cases described in my last message. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2002-06-17 16:21 Message: Logged In: YES user_id=13222 Agreed ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2002-06-17 12:05 Message: Logged In: YES user_id=290026 Given an improved understanding of section 4.1 in the XML spec, I will try to fix the following test cases in the next Expat release: ibm/not-wf/P32/ibm32n09.xml, ibm/not-wf/P68/ibm68n06.xml and sun/not-wf/not-sa03.xml In my opinion, the third one is not the same type as xmltest/not-wf/not-sa/005.xml, but the same type as the other two. About the test cases xmltest/not-wf/not-sa/010.xml and xmltest/not-wf/not-sa/011.xml: If they really check validity constraint P29, as Rolf has suggested, then it is OK that Expat does not report an error. So, If I am successful, we would be left with only: ibm/not-wf/misc/432gewf.xml and sun/not-wf/uri01.xml, conformance with which does not seem a 100% necessity, as previously discussed. Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=569461&group_id=10127 From noreply at sourceforge.net Mon Dec 12 05:35:23 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Sun, 11 Dec 2005 20:35:23 -0800 Subject: [Expat-bugs] [ expat-Bugs-1284386 ] Byte count in large XML files fails Message-ID: Bugs item #1284386, was opened at 2005-09-07 21:01 Message generated for change (Comment added) made by kwaclaw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rolf Ade (pointsman) Assigned to: Karl Waclawek (kwaclaw) Summary: Byte count in large XML files fails Initial Comment: XML_GetCurrentByteIndex(XML_Parser parser) returns a long, which is at least on the most 32 bit Systems 32 bit long. That means, for XML input larger than 2 GByte file size, XML_GetCurrentByteIndex() returns does not return the right number. Sure, such big XML files will be parsed in chunks, so it is possbile, to keep track about the nr of overflows by self, but come on. It's surely a limbo dance by its own to introcude long long in a source, so portable as expat, but that would be it. If you switch to long long if avaliable for this, please consider also XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber(). They return an int, which is on most 32-byte systems 2 Gig. Though, I'm not stumbled over this two limits in real life, as I in fact did with XML_GetCurrentByteIndex(). ---------------------------------------------------------------------- >Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-11 23:35 Message: Logged In: YES user_id=290026 Rolf, you are right about XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber(). The backwards compatibility is an issue for others, as far as I can tell, so I think I have it off by default. Attached is a patch I quickly put together today. Basic points: The switch is called "XML_LARGE_SIZE". I modified the type names to XML_Index (for signed large ints) and XML_Size (for unsigned large ints). If you have better names, let me know. Would you apply the patch and see how it works for you? (I assume long long works everywhere except Windows). Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-12-04 18:40 Message: Logged In: YES user_id=13222 Karl, I followed it. I must confess, I've no clear opinion about the backward compatibilty problem. For me, that is not a problem. I distribute the expat sources with my software and link it statically (so no problem even for binary distribution). One note: I don't think, that XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber() are seldom used calls. Any resonable XML processing software should be able to deal with the case, that it get feeded with not well-formed XML input. I'd guess, most programmers use that two calls to provide a usefull error msg to their users. I'm happy with any solution. I feel, it may be to much bloat (as you said) to have all that functions twice with different return type. So, I'd favour a build flag. (With the default 32 bit changed to 64 bit right after the 2.0 release). But I'm not really sure. Yes, I still have to provide some bits to check for a long long on unix plattforms, I now. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-30 09:31 Message: Logged In: YES user_id=290026 Rolf, I opened a discussion about this on the discussion list. The main issue - IMO - is backwards compatibility. I don't know how many apps on Linux, for instance, rely on the Expat API staying binary compatible. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 21:01 Message: Logged In: YES user_id=13222 XML_GetCurrentByteIndex() could return -1: Of course! You're right. And it makes sense. A freshly created or reseted parser without the first XML_Parse() call returns -1 on XML_GetCurrentByteIndex(), to signal this fact: it is not right at the start of the document, but there isn't any parsing started yet. Nice detail. I should have looked at the implementation, before replying. Note: That detail isn't mentioned in the documentation. I'm fine with a signed long long. 2^63 should be big enough, for the next few weeks ;-). Re the defines: Basically yes. It's just, that I'm pretty sure, we need one round more: some configure check for long long and depending on that result defining XML_?Int64 as long long or just long. I'll look something up (but being on deadline catching, may need a bit time). ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 20:00 Message: Logged In: YES user_id=290026 On a 32bit CPU, 64bit integer operations are considerably slower than 32 bit operations. On the other hand XMLUpdatePosition isn't called that often - mostly when you actually request the line/column number. So, I agree - no configuration necessary. For the other point: If you look at the XML_GetCurrentByteIndex() code, it can return -1, and it is calculated using a subtraction. So in practice and theory, it must be a signed integer. XML_GetCurrentByteCount is derived from a subtraction as well, but we know it will be positive because eventEndPtr should always be larger than eventPtr. So we could risk using an unsigned integer. Just playing around, I added this to expat_external.h: #ifdef XML_USE_MSC_EXTENSIONS typedef __int64 XML_Int64; typedef unsigned __int64 XML_UInt64; #else typedef long long XML_Int64; typedef unsigned long long XML_UInt64; #endif What do you think? Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 19:33 Message: Logged In: YES user_id=13222 Configurable: No There is nearly no overhead: just a few variables (at max) 8 bytes long instead of 4 bytes. Also speedwise: not mesuarable. long long acceptable everywhere: Probably no Some very old or limited embedded system may not have a long long (or equivalent). Therefor we probably need defines. Byte index could be negative: don't think so. How could that happen? Byte index starts at 0 and grows. Or do I miss something.? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 18:21 Message: Logged In: YES user_id=290026 Just some notes, so that I don't forget - should it be configurable? some may not want the overhead of a 64bit integer, especially for line number and column number. - is long long acceptable everywhere else (other than VC++)? - the byte index could be negative, but not line/column number and byte count, right? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 15:31 Message: Logged In: YES user_id=290026 You are right, Rolf, it should be 64 bits even on a 32bit platform. I guess I should make a note in the docs that Expat supports > 2GB files, as long as each chunk passed to the XML_Parse routines is smaller than 2GB. There are also issues around compiling Expat on a 64bit platform, but at least for VC++, someone has provided a patch (bug # 1105135) which looks it should work on other platforms as well (just a bunch of type casts). One issue I have already seen is that VC++ 6.0 does not know about long long. Thanks for having a look at the cross-platform issue. I am trying to get Expat 2.0 released despite Fred not being active on Expat anymore. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 15:22 Message: Logged In: YES user_id=13222 Karl, Most reasonable 32bit platforms have support for file sizes > 2 GB these days even on 32. It was in fact a 32bit platform, at which I stumbled over the problem. That for your easy question. Much harder is how to slove this in a portable way. I'm afraid that may need platform depending #defines (with fallback to long). I'll go out digging what other portable software does in this case and will come back with a more concrete proposal. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 14:22 Message: Logged In: YES user_id=290026 Rolf, should the type be 64 bit integer on all platforms, or 32bit on 32bit platforms and 64bit on 64bit platforms? I think we are talking about m_parseEndByteIndex, POSITION.lineNumber and POSITION.columnNumber. Options could be size_t, ptrdiff_t. MS VC++ 6.0 does not know about long long, but it knows about __int64. Is there an ANSI definition for 64 bit ints? What do you suggest that works on all platforms? Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 From noreply at sourceforge.net Tue Dec 13 17:25:40 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Tue, 13 Dec 2005 08:25:40 -0800 Subject: [Expat-bugs] [ expat-Bugs-1379630 ] XML_Char is not always UTF-8 Message-ID: Bugs item #1379630, was opened at 2005-12-13 08:25 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Documentation Group: None Status: Open Resolution: None Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: XML_Char is not always UTF-8 Initial Comment: Hi! I use expat 1.95.8. I read the documentation and everywhere it says that the character data submitted to my application will always be either UTF-8 and UTF-16 depending on how it was compiled. But on page 17, chapter 'Handler Setting', last paragraph it reads 'Your handlers will be receiving strings in arrays of type XML_Char. This type.....contains bytes encoding UTF-8. .....independent of the original encoding of the document. It should say UTF-8 or UTF-16 depending on how it was compiled, here too right? BRs /Olle ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127 From noreply at sourceforge.net Tue Dec 13 18:21:55 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Tue, 13 Dec 2005 09:21:55 -0800 Subject: [Expat-bugs] [ expat-Bugs-1379630 ] XML_Char is not always UTF-8 Message-ID: Bugs item #1379630, was opened at 2005-12-13 11:25 Message generated for change (Comment added) made by kwaclaw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Documentation Group: None Status: Open >Resolution: Fixed Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: XML_Char is not always UTF-8 Initial Comment: Hi! I use expat 1.95.8. I read the documentation and everywhere it says that the character data submitted to my application will always be either UTF-8 and UTF-16 depending on how it was compiled. But on page 17, chapter 'Handler Setting', last paragraph it reads 'Your handlers will be receiving strings in arrays of type XML_Char. This type.....contains bytes encoding UTF-8. .....independent of the original encoding of the document. It should say UTF-8 or UTF-16 depending on how it was compiled, here too right? BRs /Olle ---------------------------------------------------------------------- >Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-13 12:21 Message: Logged In: YES user_id=290026 Correct. Fixed in reference.html 1.67. Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127 From noreply at sourceforge.net Thu Dec 22 22:06:12 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Thu, 22 Dec 2005 13:06:12 -0800 Subject: [Expat-bugs] [ expat-Bugs-1284386 ] Byte count in large XML files fails Message-ID: Bugs item #1284386, was opened at 2005-09-07 21:01 Message generated for change (Comment added) made by kwaclaw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Rolf Ade (pointsman) Assigned to: Karl Waclawek (kwaclaw) Summary: Byte count in large XML files fails Initial Comment: XML_GetCurrentByteIndex(XML_Parser parser) returns a long, which is at least on the most 32 bit Systems 32 bit long. That means, for XML input larger than 2 GByte file size, XML_GetCurrentByteIndex() returns does not return the right number. Sure, such big XML files will be parsed in chunks, so it is possbile, to keep track about the nr of overflows by self, but come on. It's surely a limbo dance by its own to introcude long long in a source, so portable as expat, but that would be it. If you switch to long long if avaliable for this, please consider also XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber(). They return an int, which is on most 32-byte systems 2 Gig. Though, I'm not stumbled over this two limits in real life, as I in fact did with XML_GetCurrentByteIndex(). ---------------------------------------------------------------------- >Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-22 16:06 Message: Logged In: YES user_id=290026 Rolf, I think I go for the simple solution: use long long, and if a compiler doesn't support it, too bad, the programmer will have to use 32 bit integers: #ifdef XML_LARGE_SIZE typedef long long XML_Index; typedef unsigned long long XML_Size; #else typedef long XML_Index; typedef unsigned long XML_Size; #endif Even if undefined, the above declarations will change the return values of XML_GetCurrentLine/ColumnNumber from int to unsigned long. Hope that is not a problem (in theory, long >= int). Anyway, I'll apply the attached patch (LargeInt2.diff) sometime soon and create a pre-release for download. Karl ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-11 23:35 Message: Logged In: YES user_id=290026 Rolf, you are right about XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber(). The backwards compatibility is an issue for others, as far as I can tell, so I think I have it off by default. Attached is a patch I quickly put together today. Basic points: The switch is called "XML_LARGE_SIZE". I modified the type names to XML_Index (for signed large ints) and XML_Size (for unsigned large ints). If you have better names, let me know. Would you apply the patch and see how it works for you? (I assume long long works everywhere except Windows). Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-12-04 18:40 Message: Logged In: YES user_id=13222 Karl, I followed it. I must confess, I've no clear opinion about the backward compatibilty problem. For me, that is not a problem. I distribute the expat sources with my software and link it statically (so no problem even for binary distribution). One note: I don't think, that XML_GetCurrentLineNumber() and XML_GetCurrentColumnNumber() are seldom used calls. Any resonable XML processing software should be able to deal with the case, that it get feeded with not well-formed XML input. I'd guess, most programmers use that two calls to provide a usefull error msg to their users. I'm happy with any solution. I feel, it may be to much bloat (as you said) to have all that functions twice with different return type. So, I'd favour a build flag. (With the default 32 bit changed to 64 bit right after the 2.0 release). But I'm not really sure. Yes, I still have to provide some bits to check for a long long on unix plattforms, I now. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-30 09:31 Message: Logged In: YES user_id=290026 Rolf, I opened a discussion about this on the discussion list. The main issue - IMO - is backwards compatibility. I don't know how many apps on Linux, for instance, rely on the Expat API staying binary compatible. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 21:01 Message: Logged In: YES user_id=13222 XML_GetCurrentByteIndex() could return -1: Of course! You're right. And it makes sense. A freshly created or reseted parser without the first XML_Parse() call returns -1 on XML_GetCurrentByteIndex(), to signal this fact: it is not right at the start of the document, but there isn't any parsing started yet. Nice detail. I should have looked at the implementation, before replying. Note: That detail isn't mentioned in the documentation. I'm fine with a signed long long. 2^63 should be big enough, for the next few weeks ;-). Re the defines: Basically yes. It's just, that I'm pretty sure, we need one round more: some configure check for long long and depending on that result defining XML_?Int64 as long long or just long. I'll look something up (but being on deadline catching, may need a bit time). ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 20:00 Message: Logged In: YES user_id=290026 On a 32bit CPU, 64bit integer operations are considerably slower than 32 bit operations. On the other hand XMLUpdatePosition isn't called that often - mostly when you actually request the line/column number. So, I agree - no configuration necessary. For the other point: If you look at the XML_GetCurrentByteIndex() code, it can return -1, and it is calculated using a subtraction. So in practice and theory, it must be a signed integer. XML_GetCurrentByteCount is derived from a subtraction as well, but we know it will be positive because eventEndPtr should always be larger than eventPtr. So we could risk using an unsigned integer. Just playing around, I added this to expat_external.h: #ifdef XML_USE_MSC_EXTENSIONS typedef __int64 XML_Int64; typedef unsigned __int64 XML_UInt64; #else typedef long long XML_Int64; typedef unsigned long long XML_UInt64; #endif What do you think? Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 19:33 Message: Logged In: YES user_id=13222 Configurable: No There is nearly no overhead: just a few variables (at max) 8 bytes long instead of 4 bytes. Also speedwise: not mesuarable. long long acceptable everywhere: Probably no Some very old or limited embedded system may not have a long long (or equivalent). Therefor we probably need defines. Byte index could be negative: don't think so. How could that happen? Byte index starts at 0 and grows. Or do I miss something.? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 18:21 Message: Logged In: YES user_id=290026 Just some notes, so that I don't forget - should it be configurable? some may not want the overhead of a 64bit integer, especially for line number and column number. - is long long acceptable everywhere else (other than VC++)? - the byte index could be negative, but not line/column number and byte count, right? ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 15:31 Message: Logged In: YES user_id=290026 You are right, Rolf, it should be 64 bits even on a 32bit platform. I guess I should make a note in the docs that Expat supports > 2GB files, as long as each chunk passed to the XML_Parse routines is smaller than 2GB. There are also issues around compiling Expat on a 64bit platform, but at least for VC++, someone has provided a patch (bug # 1105135) which looks it should work on other platforms as well (just a bunch of type casts). One issue I have already seen is that VC++ 6.0 does not know about long long. Thanks for having a look at the cross-platform issue. I am trying to get Expat 2.0 released despite Fred not being active on Expat anymore. Karl ---------------------------------------------------------------------- Comment By: Rolf Ade (pointsman) Date: 2005-11-27 15:22 Message: Logged In: YES user_id=13222 Karl, Most reasonable 32bit platforms have support for file sizes > 2 GB these days even on 32. It was in fact a 32bit platform, at which I stumbled over the problem. That for your easy question. Much harder is how to slove this in a portable way. I'm afraid that may need platform depending #defines (with fallback to long). I'll go out digging what other portable software does in this case and will come back with a more concrete proposal. rolf ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-11-27 14:22 Message: Logged In: YES user_id=290026 Rolf, should the type be 64 bit integer on all platforms, or 32bit on 32bit platforms and 64bit on 64bit platforms? I think we are talking about m_parseEndByteIndex, POSITION.lineNumber and POSITION.columnNumber. Options could be size_t, ptrdiff_t. MS VC++ 6.0 does not know about long long, but it knows about __int64. Is there an ANSI definition for 64 bit ints? What do you suggest that works on all platforms? Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1284386&group_id=10127 From noreply at sourceforge.net Mon Dec 26 19:59:05 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 26 Dec 2005 10:59:05 -0800 Subject: [Expat-bugs] [ expat-Bugs-1150814 ] expat_win32bin StaticLibs are wrong version Message-ID: Bugs item #1150814, was opened at 2005-02-24 02:42 Message generated for change (Comment added) made by kwaclaw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1150814&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open >Resolution: Fixed Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: expat_win32bin StaticLibs are wrong version Initial Comment: Installed version of expat: expat_win32bin_1_95_8.exe, which contains both dynamic and static versions of the library. Linking against StaticLibs/libexpatw.lib and StaticLibs/libexpat.lib produce the following errors base.lib(XMLReader_Expat.obj) : error LNK2001: unresolved external symbol _XML_ResumeParser base.lib(XMLReader_Expat.obj) : error LNK2001: unresolved external symbol _XML_StopParser I suspected that the static libs were out of date, so I investigated a little... the files are dated September 06, 2002, 6:04:54 PM. 2002 is pretty old, so I'm guessing this is not new to 1.95.8. A temporary fix is to link against the multi-threaded (libexpatwMT.lib) static library, which seems to work just fine. - Tim Valenzuela tim _at_ deadlyninja.com ---------------------------------------------------------------------- >Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-26 13:59 Message: Logged In: YES user_id=290026 In later versions of MS VC++, the single-threaded runtime library is not supported anymore. Expat 2.0 will be the last release including the single-threaded static lib in the Windows binary distribution. Karl ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-02-24 08:58 Message: Logged In: YES user_id=290026 Another one for Fred. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1150814&group_id=10127 From noreply at sourceforge.net Tue Dec 27 19:42:33 2005 From: noreply at sourceforge.net (SourceForge.net) Date: Tue, 27 Dec 2005 10:42:33 -0800 Subject: [Expat-bugs] [ expat-Bugs-1379630 ] XML_Char is not always UTF-8 Message-ID: Bugs item #1379630, was opened at 2005-12-13 11:25 Message generated for change (Settings changed) made by kwaclaw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Documentation Group: None >Status: Closed Resolution: Fixed Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Fred L. Drake, Jr. (fdrake) Summary: XML_Char is not always UTF-8 Initial Comment: Hi! I use expat 1.95.8. I read the documentation and everywhere it says that the character data submitted to my application will always be either UTF-8 and UTF-16 depending on how it was compiled. But on page 17, chapter 'Handler Setting', last paragraph it reads 'Your handlers will be receiving strings in arrays of type XML_Char. This type.....contains bytes encoding UTF-8. .....independent of the original encoding of the document. It should say UTF-8 or UTF-16 depending on how it was compiled, here too right? BRs /Olle ---------------------------------------------------------------------- Comment By: Karl Waclawek (kwaclaw) Date: 2005-12-13 12:21 Message: Logged In: YES user_id=290026 Correct. Fixed in reference.html 1.67. Karl ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1379630&group_id=10127