Newbie XML SAX Parsing: How do I ignore an invalid token?

"Martin v. Löwis" martin at v.loewis.de
Sun Jan 7 14:32:38 EST 2007


scott at crybabymaternity.com schrieb:
> Is there a Pythonic way to read the file and identify any illegal XML
> characters so I can strip them out? this would keep my program more
> flexible - if the vendor is going to allow one illegal character in
> their document, there's no way of knowing if another one will pop up
> later.

Notice that you are talking about bytes here, not characters. It is
inherently difficult to determine invalid bytes - you first have to
determine the encoding, then (mentally) decode, and then find out
whether there are any invalid characters.

The invalid XML characters can be found in

http://www.w3.org/TR/2006/REC-xml-20060816/#charsets

So invalid characters are #x0 .. #x8, #xB, #xC, #xE .. #x1F,
#xD800 .. #xDFFF, #xFFFE, #xFFFF.

If you restrict attention to only the invalid characters below
#x20 (i.e. control characters), and also restrict attention to
encodings that are strict ASCII supersets (ASCII, ISO-8859-x,
UTF-8), you can filter out the invalid characters on the byte
level. Otherwise, you have to decode, filter out on the character
level, and then encode again. Neither approach will deal with
bytes that are invalid wrt. the encoding.

To filter out these bytes, I recommend to use str.translate.
Make an identity table for the substitution, and put the
bytes you want deleted into the delete table.

Regards,
Martin



More information about the Python-list mailing list