Is there a HTML parser who can reconstruct the original html EXACTLY?
Fuzzyman
fuzzyman at gmail.com
Wed Jan 23 16:47:37 EST 2008
ios... at gmail.com wrote:
> Hi, I am looking for a HTML parser who can parse a given page into
> a DOM tree, and can reconstruct the exact original html sources.
> Strictly speaking, I should be allowed to retrieve the original
> sources at each internal nodes of the DOM tree.
> I have tried Beautiful Soup who is really nice when dealing with
> those god damned ill-formed documents, but it's a pity for me to find
> that this guy cannot retrieve original sources due to its great tidy
> job.
> Since Beautiful Soup, like most of the other HTML parsers in
> python, is a subclass of sgmllib.SGMLParser to some extent, I have
> investigated the source code of sgmllib.SGMLParser, see if there is
> anything I can do to tell Beautiful Soup where he can find every tag
> segment from HTML source, but this will be a time-consuming job.
> so... any ideas?
>
A while ago I had a similar need, but my solution may not solve your
problem.
I wanted to rewrite URLs contained in links and images etc, but not
modify any of the rest of the HTML. I created an HTML parser (based on
sgmllib) with callbacks as it encounters tags and attributes etc.
It is easy to process a stream without 'damaging' the beautiful
orginal structure of crap HTML - but it doesn't provide a DOM.
http://www.voidspace.org.uk/python/recipebook.shtml#scraper
All the best,
Michael Foord
http://www.manning.com/foord
>
> cheers
> kai liu
More information about the Python-list
mailing list