Check URL --> Simply? (fwd)
Julius Welby
jwelby at waitrose.com
Thu Aug 16 00:56:55 EDT 2001
That looks pretty comprehensive!
I thought I should repost a URL to my availability checker with e-mail alert
(alpha version), as it checks for the ability to do a URL open, and then
looks for a piece of arbitrary text to signify a failed attempt, so it is
on-topic.
http://www.outwardlynormal.com/python/igor.txt
I've only tested it for one recipient of the e-mail, but it works fine for
me on multiple sites (you define a list of sites to check).
I'll rewrite it at some point, but I hope this is of some use.
"Dr. David Mertz" <mertz at gnosis.cx> wrote in message
news:mailman.997923024.24854.python-list at python.org...
> "Alex Martelli" <aleaxit at yahoo.com> with usual wisdom wrote:
> |So, slashdot doesn't give an error when I try to /GET that URL -- it
> |appears to give a perfectly valid page....
> |You may pepper your checking code with a zillion special cases
> |to try and identify the various "friendly error message pages"
> |returned by sites you're interested in -- and one day, of course,
> |your program will end up considering "not found" a perfectly
> |valid URL to a document with a title such as "404 File Not
> |Found" or something like that.
>
> Ever more opportunity at shameless self-promotion. This zillion special
> cases of 404-ish pages is something I use as an example in my
> forthcoming book _Text Processing in Python_ (a few more months until
> done). Here's the code I present as an attempt at recognizing what only
> humans can:
>
> #---------- error_page.py ----------#
> import re, sys
> page = sys.stdin.read()
>
> # Mapping from patterns to probability contribution of pattern
> err_pats = {r'(?is)<TITLE>.*?(404|403).*?ERROR.*?</TITLE>': 0.95,
> r'(?is)<TITLE>.*?ERROR.*?(404|403).*?</TITLE>': 0.95,
> r'(?is)<TITLE>ERROR</TITLE>': 0.30,
> r'(?is)<TITLE>.*?ERROR.*?</TITLE>': 0.10,
> r'(?is)<META .*?(404|403).*?ERROR.*?>': 0.80,
> r'(?is)<META .*?ERROR.*?(404|403).*?>': 0.80,
> r'(?is)<TITLE>.*?File Not Found.*?</TITLE>': 0.80,
> r'(?is)<TITLE>.*?Not Found.*?</TITLE>': 0.40,
> r'(?is)<BODY.*(404|403).*</BODY>': 0.10,
> r'(?is)<H1>.*?(404|403).*?</H1>': 0.15,
> r'(?is)<BODY.*not found.*</BODY>': 0.10,
> r'(?is)<H1>.*?not found.*?</H1>': 0.15,
> r'(?is)<BODY.*the requested URL.*</BODY>': 0.10,
> r'(?is)<BODY.*the page you requested.*</BODY>': 0.10,
> r'(?is)<BODY.*page.{1,50}unavailable.*</BODY>': 0.10,
> r'(?is)<BODY.*request.{1,50}unavailable.*</BODY>': 0.10,
> r'(?i)does not exist': 0.10,
> }
> err_prob = 0
> for pat, prob in err_pats.items():
> if err_prob > 0.9: break
> if re.search(pat, page):
> # print pat, prob
> err_prob += prob
>
> if err_prob > 0.90: print 'Page is almost surely an error report'
> elif err_prob > 0.75: print 'It is highly likely page is an error
report'
> elif err_prob > 0.50: print 'Better-than-even odds page is error
report'
> elif err_prob > 0.25: print 'Fair indication page is an error report'
> else: print 'Page is probably real content'
>
> You could play with this to include urllib.urlopen() instead of just
> reading STDIN, or course. The threshhold approach, IMO, does pretty
> well. But nothing's perfect... in fact, I've found pages that I have a
> lot of trouble saying for sure whether they are real content or not
> using my own eyes.
>
> Yours, David...
>
More information about the Python-list
mailing list