[XML-SIG] checking a string for well-formedness

Paul Tremblay phthenry@earthlink.net
Wed, 7 May 2003 12:04:18 -0400


I need to check a string for well-formedness. I stumbed across the
fact that you can use expat directly, so I devised this code, which
works, so long as unicode and entities aren't used:

import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate()
import sys

def validate(data):
    parser.Parse(data)
    try:
        parser.Parse(data)
        return 0
    except xml.parsers.expat.ExpatError:
        sys.stderr.write('tagging text will result in invalid XML\n')
        return 1

data = '<doc><tag>text</tag><tag>text,</tag></doc>'
validate(data)

The function validate returns 0 in this case. However, if I try this:

data = u'<doc><tag>text</tag><tag>text\u201c</tag></doc>'

I get the following error:


Traceback (most recent call last):
  File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
    parser.Parse(data)                                         
UnicodeError: ASCII encoding error: ordinal not in range(128)

Any idea what is going on here? 

I have re-written the function so that it it writes the string to a
file, and then I use SAX to parse the file. If SAX fails, I know I
have ill-formed XML. However, this second solution is a kludge. I
would like to be able to test the string directly.

Thanks

Paul
-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************