MSHTML
Robert Amesz
reqhye72zux at mailexpire.com
Mon Aug 20 10:44:52 EDT 2001
Jay Parlar wrote:
> For the application that my colleague and I are working on, it is
> necessary that we be able to take the raw HTML of some document and
> pull out just the text, with all tags removed.
>
> [COM stuff snip'd]
>
> My question: Is there any method to supress all the other IE stuff
> so I can essentially use MSHTML as a pure replacement to
> HTMLParser?
All that trouble just to get at the plain text of HTML-documents? That
sounds a bit over the top to me. The only possibe reason I can think of
to justify it is if Javascript or VBScript issues document.write()'s or
modifies some part of the DOM directly. (That might also have
something to do with why the thing crashes every now and again: perhaps
concurrent access to the DOM tree isn't too reliable. Perhaps there's a
property which you can query to check if the document is fully loaded
and processed.)
I am afraid I don't know how to suppress all that IE stuff, but I
presume that at least for some of those nag-messages properties can be
set to get rid of those, along with other security-related settings.
You really need to get some documentation about the HTML COM object, I
guess searching the MSDN site would be a good place to start. But I'd
be willing to bet you can't get rid of all unwanted behaviour.
Alternatively, why not run an HTML file through W3C's Tidy first, that
will get rid of most of the errors (providing it doesn't get too
confused) and continue to use Python's HTML (or SGML) parser. That
would seem like a much more robust (and portable) scheme.
Robert Amesz
More information about the Python-list
mailing list