[Expat-discuss] & symbol workaround

Wed Feb 4 22:15:23 CET 2009

Brad Causey wrote:
> I completely agree. Unfortunately, I don't have control over the code that
> generates these XML files.
> If there isn't a better alternative, I'll have to create a duplicate of
> EVERY file and parse each one at a text level to replace non-standard
> characters with a escaped version. (doing this for < is nearly impossible)
> This is something I am trying to avoid for obvious reasons. I don't like
> non-standard XML any more than the next guy. (I've been through 3 different
> python XML parsers trying to resolve this) But I'm running out of options.
> Any ideas?

This is not the world of network protocols. The markup world is very
strict about syntax. An entity is either a well-formed XML document
or it is not, no fuss, even no doubt (belive it or not: at least at
this basic level all major parsers out there agree, even in bizarre
cases), no Robustness Principle.

Something with a single (not escaped) ampersand in it isn't an XML
document. Point.

Even worser for you: I don't know any parser, that would let that pass.

Raise the problem with your input data. Just that you've done it.

If 'they' force you, to handle the problem I'm afraid, there is no
other way, than to modify you input data, with a preprocessing step on
a copy or, if the sizes are small, in memory, if you want to use an
XML parser.

I'm sorry, I haven't better news.
rolf

> 
> 
> 
> -Brad
> 
> 
> On Wed, Feb 4, 2009 at 2:30 PM, Nick <nickmacd at xxx.com> wrote:
> 
>> amp is NOT valid as a standalone character in XML and needs to be
>> escaped as &amp; otherwise you are not parsing standard (and thus
>> valid) XML files, but in fact parsing some other hybrid thing.
>>
>> Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ):
>>
>> The ampersand character (&) and the left angle bracket (<) MUST NOT
>> appear in their literal form, except when used as markup delimiters,
>> or within a comment, a processing instruction, or a CDATA section. If
>> they are needed elsewhere, they MUST be escaped using either numeric
>> character references or the strings " &amp;  " and " &lt;  "
>> respectively. The right angle bracket (>) may be represented using the
>> string " &gt;  ", and MUST, for compatibility, be escaped using either
>> " &gt;  " or a character reference when it appears in the string " ]]>
>>  " in content, when that string is not marking the end of a CDATA
>> section.
>>
>> So I would argue that you NEED to change the source files, in order to
>> bring them into line with the standard.
>>
>> Nick
>>
>>
>> On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey <bradcausey at xxx.com<bradcausey at gmail.com>>
>> wrote:
>>> I am working on a Python script that parses around 6800 small xml files.
>>> My code isn't pretty, as I am just testing a PoC at this point, but I
>> have
>>> run into a problem. When the script hits the Ampersand symbol, it quits
>> with
>>> "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28,
>>> column 41"
>>>
>>> I am trying to figure out a way to work around this without modifying the
>>> XML files themselves as these need to be preserved in the original
>> format.
>> <NickMacD at gmail.com>
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss
> 
>