python fast HTML data extraction library

John Machin sjmachin at lexicon.net
Sun Jul 26 11:51:39 EDT 2009


On Jul 23, 11:53 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jul 22, 5:43 pm, Filip <pink... at gmail.com> wrote:

>
> # Needs re.IGNORECASE, and can have tag attributes, such as <BR
> CLEAR="ALL">
> line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />

> # what about HTML entities defined using hex syntax, such as &#xxxx;
> amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only   and &#xa0;
but also  

Also, entity names can contain digits e.g. &sup1; &frac34;



More information about the Python-list mailing list