[XML-SIG] Newbie : Identifying characters that will choke XML parser

James Oakley joakley@solutioninc.com
Tue, 6 May 2003 10:40:32 -0300


=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 06 May 2003 09:17 am, Ian Sparks wrote:
> Hmm...as I feared. As I discover new XML-chokers I'm building up a library
> like :
>
> #Remove ACK's (I've seen it!)
> w =3D w.replace(chr(6),'')
> #Remove ... characters (again, I've seen it)
> w =3D w.replace(chr(133),'')
>
> I was hoping to find some way of identifying everything that will choke my
> XML, some rule to auto-filter out the nastiness..

I had the same trouble with a xmlrpclib.loads(). Here's what I did:


# These characters are forbidden in XML. We'll just drop them
badchars =3D (0, 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20=
, \
            21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)

# Remove invalid characters
data =3D filter(lambda x: ord(x) not in badchars, data)

# Convert 8-bit characters to numeric entities
data =3D re.sub(r'[\x7F-\xFF]',
               lambda m: '&#%d;' % ord(m.group(0)),
              data
             )



Hope that helps,

=2D --=20
James Oakley
Engineering - SolutionInc Ltd.
joakley@solutioninc.com
http://www.solutioninc.com
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE+t7tR+FOexA3koIgRAoXmAKCSPJhyTIX/s3jWewvJf1n1l4dn8gCghr1X
kZ6QtVmIR33bJAtXaceH/L0=3D
=3DE+Xs
=2D----END PGP SIGNATURE-----