(htmllib) How to capture text that includes tags?
Paul Rubin
http
Wed Nov 5 11:59:18 EST 2003
I've generally found that trying to parse the whole page with
regexps isn't appropriate. Here's a class that I use sometimes.
Basically you do something like
b = buf(urllib.urlopen(url).read())
and then search around for patterns you expect to find in the page:
b.search("name of the product")
b.rsearch('<a href="')
href = b.up_to('"')
Note that there's an esearch method that lets you do forward searches
for regexps (defaults to case independent since that's usually what
you want for html). But unfortunately, due to a deficiency in the Python
library, there's no simple way to implement backwards regexp searches.
Maybe I'll clean up the interface for this thing sometime.
================================================================
import re
class buf:
def __init__(self, text=''):
self.buf = text
self.point = 0
self.stack = []
def seek(self, offset, whence='set'):
if whence=='set':
self.point = offset
elif whence=='cur':
self.point += offset
elif whence=='end':
self.point = len(self.buf) - offset
else:
raise ValueError, "whence must be one of ('set','cur','end')"
def save(self):
self.stack.append(self.point)
def restore(self):
self.point = self.stack.pop()
def search(self, str):
p = self.buf.index(str, self.point)
self.point = p + len(str)
return self.point
def esearch(self, pat, *opts):
opts = opts or [re.I]
p = re.compile(pat, *opts)
g = p.search(self.buf, self.point)
self.point = g.end()
return self.point
def rsearch(self, str):
p = self.buf.rindex(str, 0, self.point)
self.point = p
return self.point
def up_to(self, str):
a = self.point
b = self.search(str)
return self.buf[a:b-1]
More information about the Python-list
mailing list