[XML-SIG] Re: checking a string for well-formedness

Paul Tremblay phthenry@earthlink.net
Thu, 8 May 2003 18:08:38 -0400


> the parse function requires an 8-bit string, and Python defaults
> to ASCII when converting Unicode to 8-bit data.

I must be dense when it comes to unicode. So Python converts unicode
to a 7-bit (ASCII) string?

You solution worked, but then I immediately ame up  ith a new problem
when I tried to test the speed of this funciton:

# assume the same exact funtion from below, which I cut and pasted
for j in range(10):
    data = u'<doc><tag>text\u201c</tag><tag>thext,</tag></doc>'
    validate(data)

The first time the string is tested, it comes out as valid. But every
single instance afterwards comes out all ill-formed XML.

Thanks

Paul


On Thu, May 08, 2003 at 11:54:57AM +0200, Fredrik Lundh wrote:
> To: xml-sig@python.org
> From: "Fredrik Lundh" <fredrik@pythonware.com>
> Subject: [XML-SIG] Re: checking a string for well-formedness
> Date: Thu, 8 May 2003 11:54:57 +0200
> 
> Paul Tremblay wrote:
> 
> > import xml.parsers.expat
> > parser = xml.parsers.expat.ParserCreate()
> > import sys
> >
> > def validate(data):
> >     parser.Parse(data)
> >     try:
> >         parser.Parse(data)
> >         return 0
> >     except xml.parsers.expat.ExpatError:
> >         sys.stderr.write('tagging text will result in invalid XML\n')
> >         return 1
> >
> > data = '<doc><tag>text</tag><tag>text,</tag></doc>'
> > validate(data)
> >
> > The function validate returns 0 in this case.
> 
> or raise an exception, if you don't remove the first call to
> parser.Parse(data).
> 
> unfortunately, even if you remove that line, the function may
> still return 0 for invalid XML snippets, e.g:
> 
> > data = '<doc><tag>text</tag><tag>text,</tag>'
> 
> to fix this, you have to tell the parser that you won't call
> it again with more data:
> 
>     parser.Parse(data, 1)
> 
> > However, if I try this:
> >
> > data = u'<doc><tag>text</tag><tag>text\u201c</tag></doc>'
> >
> > I get the following error:
> >
> > Traceback (most recent call last):
> >   File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
> >     parser.Parse(data)
> > UnicodeError: ASCII encoding error: ordinal not in range(128)
> >
> > Any idea what is going on here?
> 
> the parse function requires an 8-bit string, and Python defaults
> to ASCII when converting Unicode to 8-bit data.
> 
> the simplest way to work around this is to convert the string to
> the XML default encoding (utf-8) on the way in:
> 
> def validate(data):
>     try:
>         if isinstance(data, type(u"")):
>             data = data.encode("utf-8")
>         parser.Parse(data, 1)
>         return 0
>     except xml.parsers.expat.ExpatError:
>         sys.stderr.write('tagging text will result in invalid XML\n')
>         return 1
> 
> </F>
> 
> 
> 
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************