[XML-SIG] sgmlop and html parsing

Alexandre Fayolle Alexandre.Fayolle at logilab.fr
Wed Jan 14 09:36:52 EST 2004


On Wed, Jan 14, 2004 at 09:26:17AM -0500, Thomas B. Passin wrote:
> Alexandre Fayolle wrote:
> >>
> >>This should happen only if self->unicode is false. This is XML parsing,
> >>right? If so, you should enable self->unicode, and it will give you
> >>a unicode character (in handle_data).
> >
> >
> >This is netscape bookmark parsing, so this is not well formed XML (lots
> >of tags are not closed). 
> >
> >demo/xbel/ns_parse.py calls sax2exts.SGMLParserFactory.make_parser(), so
> >I expect it to return an SGML parser, and not an XML reader. 
> 
> I took a different approach.  To parse Netscape bookmark files, I just 
> take the default parser, and handle the encoding downstream using a few 
> patches in the downstream code to handle encoding. (I have found that 
> setting the encoding to utf-8 works reliably in Mozilla-derived browsers 
> on Windows 2000.

Would you mind committing your changes to the CVS so that they can ship
in pyxml 0.8.4 ? Your patch are likely to be better than mine since you
seem to be using the tools on a daily basis. 

-- 
Alexandre Fayolle
LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
Développement logiciel avancé - Intelligence Artificielle - Formations



More information about the XML-SIG mailing list