MSHTML

Mon Aug 20 10:44:52 EDT 2001

Jay Parlar wrote:

> For the application that my colleague and I are working on, it is
> necessary that we be able to take the raw HTML of some document and
> pull out just the text, with all tags removed. 
> 
> [COM stuff snip'd]
>
> My question: Is there any method to supress all the other IE stuff
> so I can essentially use MSHTML as a pure replacement to
> HTMLParser? 

All that trouble just to get at the plain text of HTML-documents? That 
sounds a bit over the top to me. The only possibe reason I can think of 
to justify it is if Javascript or VBScript issues document.write()'s or 
modifies  some part of the DOM directly. (That might also have 
something to do with why the thing crashes every now and again: perhaps 
concurrent access to the DOM tree isn't too reliable. Perhaps there's a 
property which you can query to check if the document is fully loaded 
and processed.)

I am afraid I don't know how to suppress all that IE stuff, but I 
presume that at least for some of those nag-messages properties can be 
set to get rid of those, along with other security-related settings. 
You really need to get some documentation about the HTML COM object, I 
guess searching the MSDN site would be a good place to start. But I'd 
be willing to bet you can't get rid of all unwanted behaviour.

Alternatively, why not run an HTML file through W3C's Tidy first, that 
will get rid of most of the errors (providing it doesn't get too 
confused) and continue to use Python's HTML (or SGML) parser. That 
would seem like a much more robust (and portable) scheme.

Robert Amesz