WebCrawler (was: 'Question concerning this list')

Thomas Ploch Thomas.Ploch at gmx.net
Sun Dec 31 08:30:58 EST 2006


Marc 'BlackJack' Rintsch schrieb:
> In <mailman.2169.1167563637.32031.python-list at python.org>, Thomas Ploch
> wrote:
> 
>> This is how my regexes look like:
>>
>> import re
>>
>> class Tags:
>>     def __init__(self, sourceText):
>>         self.source = sourceText
>>         self.curPos = 0
>>         self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
>>         self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
>>                             % self.namePattern)
>>         self.attrPattern = re.compile(
>>             r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
>>                 % self.namePattern)
> 
> Have you tested this with tags inside comments?

No, but I already see your point that it will parse _all_ tags, even if
they are commented out. I am thinking about how to solve this. Probably
I just take the chunks between comments and feed it to the regular
expressions.

>>>> You are probably right. For me it boils down to these problems:
>>>> - Implementing a stack for large queues of documents which is faster
>>>> than list.pop(index) (Is there a lib for this?)
>>> If you need a queue then use one:  take a look at `collections.deque` or
>>> the `Queue` module in the standard library.
>> Which of the two would you recommend for handling large queues with fast
>> response times?
> 
> `Queue.Queue` builds on `collections.deque` and is thread safe.  Speedwise
> I don't think this makes a difference as the most time is spend with IO
> and parsing.  So if you make your spider multi-threaded to gain some speed
> go with `Queue.Queue`.

I think I will go for collections.deque (since I have no intention of
making it multi-threaded) and have several queues, one for each server
in a list to actually finish one server before being directed to the
next one straight away (Is this a good approach?).

Thanks a lot,
Thomas





More information about the Python-list mailing list