python fast HTML data extraction library

Paul McGuire ptmcg at austin.rr.com
Wed Jul 22 21:53:33 EDT 2009


On Jul 22, 5:43 pm, Filip <pink... at gmail.com> wrote:
>
> My library, rather than parsing the whole input into a tree, processes
> it like a char stream with regular expressions.
>

Filip -

In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm.  But since you have stepped up to the task,
here are some comments on your re's:

# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal).  raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

# what about HTML entities defined using hex syntax, such as &#xxxx;
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

How would you extract data from a table?  For instance, how would you
extract the data entries from the table at this URL:
http://tf.nist.gov/tf-cgi/servers.cgi ?  This would be a good example
snippet for your module documentation.

Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

Good luck!

-- Paul



More information about the Python-list mailing list