Is there a HTML parser who can reconstruct the original html EXACTLY?

Wed Jan 23 16:47:37 EST 2008

ios... at gmail.com wrote:
> Hi, I am looking for a HTML parser who can parse a given page into
> a DOM tree,  and can reconstruct the exact original html sources.
> Strictly speaking, I should be allowed to retrieve the original
> sources at each internal nodes of the DOM tree.
>     I have tried Beautiful Soup who is really nice when dealing with
> those god damned ill-formed documents, but it's a pity for me to find
> that this guy cannot retrieve original sources due to its great tidy
> job.
>     Since Beautiful Soup, like most of the other HTML parsers in
> python, is a subclass of sgmllib.SGMLParser to some extent,  I have
> investigated the source code of sgmllib.SGMLParser,  see if there is
> anything I can do to tell Beautiful Soup where he can find every tag
> segment from HTML source, but this will be a time-consuming job.
>     so... any ideas?
>

A while ago I had a similar need, but my solution may not solve your
problem.

I wanted to rewrite URLs contained in links and images etc, but not
modify any of the rest of the HTML. I created an HTML parser (based on
sgmllib) with callbacks as it encounters tags and attributes etc.

It is easy to process a stream without 'damaging' the beautiful
orginal structure of crap HTML - but it doesn't provide a DOM.

http://www.voidspace.org.uk/python/recipebook.shtml#scraper

All the best,

Michael Foord
http://www.manning.com/foord

>
> cheers
> kai liu