[Expat-discuss] Line endings and the default handlers

Karl Waclawek karl@waclawek.net
Tue Jul 23 20:46:01 2002


> Karl Waclawek writes:
>  > I read the XML spec again, and it seems that absolutely every
>  > line break, no matter where, has to be normalized.
> 
> I've sent an email to the Python XML-SIG to see if anyone there thinks
> fixing this will be problematic:
> 
> http://mail.python.org/pipermail/xml-sig/2002-July/008105.html
> 

I also looked at the code again. I am almost sure that
it was intentional that there is no normalization for
the default handler. 

Actually, I just dug out this comment for the default handler
in expat.h, which explains it all:

/* This is called for any characters in the XML document for which
   there is no applicable handler.  This includes both characters that
   are part of markup which is of a kind that is not reported
   (comments, markup declarations), or characters that are part of a
   construct which could be reported but for which no handler has been
   supplied. The characters are passed exactly as they were in the XML
   document except that they will be encoded in UTF-8.  Line
   boundaries are not normalized. Note that a byte order mark
   character is not passed to the default handler. There are no
   guarantees about how characters are divided between calls to the
   default handler: for example, a comment might be split between
   multiple calls.
*/

So, it would be nice to know what the intention was.
Maybe to enable round-tripping? First you process the
regular handler, then you call XML_DefaultCurrent from
within that handler, to get the exact data from the input
document. Just a guess.

Karl