MSHTML

Jay Parlar jparlar at home.com
Mon Aug 20 11:27:58 EDT 2001


> >
> > My question: Is there any method to supress all the other IE stuff
> > so I can essentially use MSHTML as a pure replacement to
> > HTMLParser? 
> 
> All that trouble just to get at the plain text of HTML-documents? That 
> sounds a bit over the top to me. The only possibe reason I can think of 
> to justify it is if Javascript or VBScript issues document.write()'s or 
> modifies  some part of the DOM directly. (That might also have 
> something to do with why the thing crashes every now and again: perhaps 
> concurrent access to the DOM tree isn't too reliable. Perhaps there's a 
> property which you can query to check if the document is fully loaded 
> and processed.)

Hehe, I know, it does sound like a lot of trouble just to get at a the plain text. The thing is, while HTMLParser will do 
an adequate job, I don't feel right presenting my employer with an "adequate product". I'm sure that having Dr. David 
Parnas as the head of my department in school has something to do with it, but I just don't want to put on my name on 
something until I really believe it's done and ready to go.

> I am afraid I don't know how to suppress all that IE stuff, but I 
> presume that at least for some of those nag-messages properties can be 
> set to get rid of those, along with other security-related settings. 
> You really need to get some documentation about the HTML COM object, I 
> guess searching the MSDN site would be a good place to start. But I'd 
> be willing to bet you can't get rid of all unwanted behaviour.

Well, if you've ever tried searching the MSDN site, you'll know why I tried asking the list first ;-) I will go back and look 
some more though. I think I'll try the Delphi-webbrowser list as well, as this problem is more the domain of those folks 
than a Python problem.
 
> Alternatively, why not run an HTML file through W3C's Tidy first, that 
> will get rid of most of the errors (providing it doesn't get too 
> confused) and continue to use Python's HTML (or SGML) parser. That 
> would seem like a much more robust (and portable) scheme.

That sounds like a good idea, but one of the reasons we tried out MSHTML in the first place was to try go get the 
whole thing running a bit faster. Currently, I have to parse over 400 documents, one right after another, in one go. 
HTMLParser was a bit slow for some of these documents, but MSHTML (when it works) seems to be faster.


> 
> Robert Amesz

Thanks for the suggestions,

Jay Parlar
----------------------------------------------------------------
Software Engineering III
McMaster University
Hamilton, Ontario, Canada

"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."





More information about the Python-list mailing list