Alternatives to XML?

Fri Aug 26 13:48:00 EDT 2016

"Frank Millman" <frank at chagford.com>:

> "Joonas Liik"  wrote in message
> news:CAB1GNpTP0GD4s4kx07r1ujRNuXtOij4vF5uNYE1cFr_Y0xvi1g at mail.gmail.com...
>> i should note tho that this example is very ad-hoc, i'm no xml expert
>> just know a bit about xml entities.  if you decide to go this route
>> there are probably some much better tested functions out there to
>> escape text for storage in xml documents.
>
> Thanks very much, Joonas.
>
> I understand now, and it seems to work fine.
>
> As a bonus, I can now include '&' in my attributes in the future if the
> need arises.
>
> Much appreciated.

XML attributes are ridiculously complicated. From the standard:

   Before the value of an attribute is passed to the application or
   checked for validity, the XML processor MUST normalize the attribute
   value by applying the algorithm below, or by using some other method
   such that the value passed to the application is the same as that
   produced by the algorithm.

    1. All line breaks MUST have been normalized on input to #xA as
       described in 2.11 End-of-Line Handling, so the rest of this
       algorithm operates on text normalized in this way.

    2. Begin with a normalized value consisting of the empty string.

    3. For each character, entity reference, or character reference in
       the unnormalized attribute value, beginning with the first and
       continuing to the last, do the following:

        * For a character reference, append the referenced character to
          the normalized value.

        * For an entity reference, recursively apply step 3 of this
          algorithm to the replacement text of the entity.

        * For a white space character (#x20, #xD, #xA, #x9), append a
          space character (#x20) to the normalized value.

        * For another character, append the character to the normalized
          value.

   If the attribute type is not CDATA, then the XML processor MUST
   further process the normalized attribute value by discarding any
   leading and trailing space (#x20) characters, and by replacing
   sequences of space (#x20) characters by a single space (#x20)
   character.

   Note that if the unnormalized attribute value contains a character
   reference to a white space character other than space (#x20), the
   normalized value contains the referenced character itself (#xD, #xA
   or #x9). This contrasts with the case where the unnormalized value
   contains a white space character (not a reference), which is replaced
   with a space character (#x20) in the normalized value and also
   contrasts with the case where the unnormalized value contains an
   entity reference whose replacement text contains a white space
   character; being recursively processed, the white space character is
   replaced with a space character (#x20) in the normalized value.

   All attributes for which no declaration has been read SHOULD be
   treated by a non-validating processor as if declared CDATA.

   It is an error if an attribute value contains a reference to an
   entity for which no declaration has been read.

   Following are examples of attribute normalization. Given the
   following declarations:

     <!ENTITY d "&#xD;">
     <!ENTITY a "&#xA;">
     <!ENTITY da "&#xD;&#xA;">

   the attribute specifications in the left column below would be
   normalized to the character sequences of the middle column if the
   attribute a is declared NMTOKENS and to those of the right columns if
   a is declared CDATA.

   =================================================================
   Attribute specification:  a=" 

                             xyz"
   a is NMTOKENS:            x y z
   a is CDATA:               #x20 #x20 x y z
   =================================================================
   Attribute specification:  a="&d;&d;A&a;&#x20;&a;B&da;"
   a is NMTOKENS:            A #x20 B
   a is CDATA:               #x20 #x20 A #x20 #x20 #x20 B #x20 #x20
   =================================================================
   Attribute specification:  a="&#xd;&#xd;A&#xa;&#xa;B&#xd;&#xa;"
   a is NMTOKENS:            #xD #xD A #xA #xA B #xD #xA
   a is CDATA:               #xD #xD A #xA #xA B #xD #xA
   =================================================================

   Note that the last example is invalid (but well-formed) if a is
   declared to be of type NMTOKENS.

   <URL: https://www.w3.org/TR/REC-xml/#AVNormalize>

Marko