Something faster then sgmllib for sucking out URLs
brueckd at tbye.com
brueckd at tbye.com
Wed Jun 12 18:15:45 EDT 2002
On Wed, 12 Jun 2002, Alex Polite wrote:
> I'm working on a webspider to fit my sick needs. The profiler
> tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> solely for extracting URLs. I'm looking for a faster way of doing
> this. Regular expressions, string searches? What's the way to go? I'm
> not a python purist. Calling some fast C program with the html as
> argument and getting back a list of URLs would be fine by me.
How fast do you need it to be? Using regular expressions + other junk (see
below) I get about 1 MB/sec on a 900 MHz Pentium III - that's a lot of
HTML and probably faster than you'll be reading it off the 'net. Also, you
can probably simplify it as I needed to get nearly all URLs (not just
hrefs); also I specifically ignore images and uglified it in a few more
places for speed reasons.
The code below returns a list of start/end URL indices but you can easily
modify it to return the actual URLs (make sure you return them from the
original buffer because for speed this function lowercases the entire web
page first and some web servers are case sensitive). Disclaimer: I've
added to this function over time as I've encountered weird cases, but that
obviously doesn't mean it now handles all weird cases. :)
-Dave
# Note that group(1) of validTagRE tells which tag we matched
validTagRE = re.compile('<(a|embed|object|param)\s.*?>', re.DOTALL)
urlStartRE = re.compile('(href|src|codebase|value)\s*=\s*[\'"]?\s*')
urlEndRE = re.compile('\s+|>')
def FindLinks(data):
'Returns a list of (start, end) url indices'
data = data.lower() # much faster than using re.I
cur = 0
links = []
while 1:
# Locate a tag we're interested in
tagMatch = validTagRE.search(data, cur)
if not tagMatch: break
tagEnd = tagMatch.end()
cur = tagEnd + 1
# Now search for attributes within that tag that contain urls
urlCur = tagMatch.end(1) + 1 # e.g. for '<embed ...', start=7
while 1:
urlMatch = urlStartRE.search(data, urlCur, tagEnd)
if not urlMatch: break
# Find the end of the url - extra work required because we
# find HTML like href=foo, href="foo" href=" foo" href='foo',
etc.
urlStart = urlMatch.end()
urlEnd = -1
lastChar = urlMatch.group().strip()[-1]
if lastChar == '=':
# URL not quoted, end is whitespace or '>'
endMatch = urlEndRE.search(data, urlStart, tagEnd)
if not endMatch: break # Bleh! Bad HTML!
urlEnd = endMatch.start()
if urlEnd > tagEnd: break # More bad HTML!
# One last check: In Javascript, might have escaped quotes
if data[urlStart:urlStart+2] == '\\"' and \
data[urlEnd-2:urlEnd] == '\\"':
urlStart += 2
urlEnd -= 2
elif lastChar == '"':
# End will be a double quote
if data[urlStart] != "'": # Make sure not part of
Javascript
urlEnd = data.find('"', urlStart)
elif lastChar == "'":
# End will be a single quote
if data[urlStart] != '"': # Make sure not part of
Javascript
urlEnd = data.find("'", urlStart)
if urlEnd != -1:
# Skip not-really-URLs
ok = 1
ch = data[urlStart]
if ch == 'j' or ch == 'm':
if data[urlStart:urlStart+11] == 'javascript:' or \
data[urlStart:urlStart+7] == 'mailto:':
ok = 0
if ok:
links.append((urlStart, urlEnd))
else:
urlEnd = tagEnd
urlCur = urlEnd + 1
if urlCur >= tagEnd:
break
return links
More information about the Python-list
mailing list