[XML-SIG] Re: checking a string for well-formedness
Paul Tremblay
phthenry@earthlink.net
Thu, 8 May 2003 18:08:38 -0400
> the parse function requires an 8-bit string, and Python defaults
> to ASCII when converting Unicode to 8-bit data.
I must be dense when it comes to unicode. So Python converts unicode
to a 7-bit (ASCII) string?
You solution worked, but then I immediately ame up ith a new problem
when I tried to test the speed of this funciton:
# assume the same exact funtion from below, which I cut and pasted
for j in range(10):
data = u'<doc><tag>text\u201c</tag><tag>thext,</tag></doc>'
validate(data)
The first time the string is tested, it comes out as valid. But every
single instance afterwards comes out all ill-formed XML.
Thanks
Paul
On Thu, May 08, 2003 at 11:54:57AM +0200, Fredrik Lundh wrote:
> To: xml-sig@python.org
> From: "Fredrik Lundh" <fredrik@pythonware.com>
> Subject: [XML-SIG] Re: checking a string for well-formedness
> Date: Thu, 8 May 2003 11:54:57 +0200
>
> Paul Tremblay wrote:
>
> > import xml.parsers.expat
> > parser = xml.parsers.expat.ParserCreate()
> > import sys
> >
> > def validate(data):
> > parser.Parse(data)
> > try:
> > parser.Parse(data)
> > return 0
> > except xml.parsers.expat.ExpatError:
> > sys.stderr.write('tagging text will result in invalid XML\n')
> > return 1
> >
> > data = '<doc><tag>text</tag><tag>text,</tag></doc>'
> > validate(data)
> >
> > The function validate returns 0 in this case.
>
> or raise an exception, if you don't remove the first call to
> parser.Parse(data).
>
> unfortunately, even if you remove that line, the function may
> still return 0 for invalid XML snippets, e.g:
>
> > data = '<doc><tag>text</tag><tag>text,</tag>'
>
> to fix this, you have to tell the parser that you won't call
> it again with more data:
>
> parser.Parse(data, 1)
>
> > However, if I try this:
> >
> > data = u'<doc><tag>text</tag><tag>text\u201c</tag></doc>'
> >
> > I get the following error:
> >
> > Traceback (most recent call last):
> > File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
> > parser.Parse(data)
> > UnicodeError: ASCII encoding error: ordinal not in range(128)
> >
> > Any idea what is going on here?
>
> the parse function requires an 8-bit string, and Python defaults
> to ASCII when converting Unicode to 8-bit data.
>
> the simplest way to work around this is to convert the string to
> the XML default encoding (utf-8) on the way in:
>
> def validate(data):
> try:
> if isinstance(data, type(u"")):
> data = data.encode("utf-8")
> parser.Parse(data, 1)
> return 0
> except xml.parsers.expat.ExpatError:
> sys.stderr.write('tagging text will result in invalid XML\n')
> return 1
>
> </F>
>
>
>
>
> _______________________________________________
> XML-SIG maillist - XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************