BeautifulSoup vs. Microsoft

Paul McGuire ptmcg at austin.rr.com
Thu Mar 29 10:06:17 EDT 2007


On Mar 29, 1:50 am, John Nagle <n... at animats.com> wrote:
> Here's a construct with which BeautifulSoup has problems.  It's
> from "http://support.microsoft.com/contactussupport/?ws=support".
>
> This is the original:
>
> <a href="http://www.microsoft.com/usability/enroll.mspx"
>      id="L_75998"
>      title="<!--http://www.microsoft.com/usability/information.mspx->"
>      onclick="return MS_HandleClick(this,'C_32179', true);">
>      Help us improve our products
> </a>
>
<snip>
>
> Strictly speaking, it's Microsoft's fault.
>
>      title="<!--http://www.microsoft.com/usability/information.mspx->"
>
> is supposed to be an HTML comment.  But it's improperly terminated.
> It should end with "-->".  So all that following stuff is from what
> follows the next "-->" which terminates a comment.
>

No, that comment is inside a quoted string, so it should be ok.

If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:


import urllib
from pyparsing import makeHTMLTags

pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()

# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]

for a in anchorTag.searchString(htmlSrc):
    if "title" in a:
        print "Title:", a.title
        print "HREF:", a.href
        # or use this statement to dump the complete tag contents
        # print a.dump()
        print

Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx

Title: Print this page
HREF: /gp/noscript/

Title: Print this page
HREF: /gp/noscript/

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

-- Paul




More information about the Python-list mailing list