[XML-SIG] Re: Re: checking a string for well-formedness

Paul Tremblay phthenry@earthlink.net
Thu, 8 May 2003 21:51:52 -0400


On Fri, May 09, 2003 at 12:16:26AM +0200, Fredrik Lundh wrote:
> 
> (please don't top-post)
> 
> Paul Tremblay wrote:
> 
> > > the parse function requires an 8-bit string, and Python defaults
> > > to ASCII when converting Unicode to 8-bit data.
> >
> > I must be dense when it comes to unicode. So Python converts unicode
> > to a 7-bit (ASCII) string?
> 
> if you're using a Unicode string where Python expects an 8-bit
> string, Python refuses to guess, and raises an exception if the
> Unicode string contains anything that's not plain ASCII.
> 

This makes a bit more sense. I'll have to read up on encoding.

> > You solution worked, but then I immediately ame up  ith a new problem
> > when I tried to test the speed of this funciton:
> >
> > # assume the same exact funtion from below, which I cut and pasted
> > for j in range(10):
> >     data = u'<doc><tag>text\u201c</tag><tag>thext,</tag></doc>'
> >     validate(data)
> >
> > The first time the string is tested, it comes out as valid. But every
> > single instance afterwards comes out all ill-formed XML.
> 
> You have to create a new parser for each run (my mistake; I'd already
> fixed two bugs in your code, and missed the third one ;-)
> 

Thanks! I thought that it would take a lot of time to create a new
instance each time (don't know why). It takes only one second on my
100 mhz machine to test my string 1,000 times. This method is much
faster than my regular expression hack.

Paul

> > > def validate(data):
> > >     try:
> > >         if isinstance(data, type(u"")):
> > >             data = data.encode("utf-8")
> 
> + +         parser = xml.parsers.expat.ParserCreate()
> 
> > >         parser.Parse(data, 1)
> > >         return 0
> > >     except xml.parsers.expat.ExpatError:
> > >         sys.stderr.write('tagging text will result in invalid XML\n')
> > >         return 1
> 
> </F>
> 
> 
> 
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************