Partial victory (was RE: [Python-Dev] RE: test_sax failing (Windows))

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Tue, 23 Jan 2001 10:55:12 -0700


> "M.-A. Lemburg" wrote:
> ...
> > > The codes from 192 to 236, 238-243 produce
> > > "UTF-8 decoding error: invalid data",
> > > the rest gives "not well-formed".
> > >
> > > I would like to know if this happens with your (Tim) modified
> > > version as well. I'm using plain vanilla BeOpen Python 2.0 .
> > 
> > This has nothing to do with Python. UTF-8 marks the codes
> > from 128-191 as illegal prefix. See Object/unicodeobject.c:
> ...
> 
> Schade.
> 
> > Perhaps the parser should catch the UnicodeError and
> > instead return a not-wellformed exception ?!
> 
> I belive it would be better.

Yes, and given there is not much time before thr 2.1 release, doing so is an 
acceptable stop-gap.  However, I think the real fix has to lie in expat.

I just had a *very* quick and dirty perusal of expat 1.2 and 1.95.1, and not 
only do the UTF-8 validity checks (at the top of xmltok.c) seem wrong, but it 
doesn't look as if they're ever invoked.

I'll try to some time to look into this more closely, or perhaps someone will 
straighten me out if I'm on the wrong trail.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python