Question concerning this list [WebCrawler]

Thomas Ploch Thomas.Ploch at gmx.net
Sun Dec 31 06:15:05 EST 2006


Marc 'BlackJack' Rintsch schrieb:
> In <mailman.2166.1167535289.32031.python-list at python.org>, Thomas Ploch
> wrote:
> 
>> Alright, my prof said '... to process documents written in structural
>> markup languages using regular expressions is a no-no.' (Because of
>> nested Elements? Can't remember) So I think he wants us to use regexes
>> to learn them. He is pointing to HTMLParser though.
> 
> Problem is that much of the HTML in the wild is written in a structured
> markup language but it's in many cases broken.  If you just search some
> words or patterns that appear somewhere in the documents then regular
> expressions are good enough.  If you want to actually *parse* HTML "from
> the wild" better use the BeautifulSoup_ parser.
> 
> .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

Yes, I know about BeautifulSoup. But as I said it should be done with
regexes. I want to extract tags, and their attributes as a dictionary of
name/value pairs. I know that most of HTML out there is *not* validated
and bollocks.

This is how my regexes look like:

import re

class Tags:
    def __init__(self, sourceText):
        self.source = sourceText
        self.curPos = 0
        self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
        self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
                            % self.namePattern)
        self.attrPattern = re.compile(
            r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
                % self.namePattern)

>> You are probably right. For me it boils down to these problems:
>> - Implementing a stack for large queues of documents which is faster
>> than list.pop(index) (Is there a lib for this?)
> 
> If you need a queue then use one:  take a look at `collections.deque` or
> the `Queue` module in the standard library.

Which of the two would you recommend for handling large queues with fast
response times?

Thomas



More information about the Python-list mailing list