Question concerning this list [WebCrawler]

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Sun Dec 31 07:54:37 EST 2006


In <mailman.2169.1167563637.32031.python-list at python.org>, Thomas Ploch
wrote:

> This is how my regexes look like:
> 
> import re
> 
> class Tags:
>     def __init__(self, sourceText):
>         self.source = sourceText
>         self.curPos = 0
>         self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
>         self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
>                             % self.namePattern)
>         self.attrPattern = re.compile(
>             r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
>                 % self.namePattern)

Have you tested this with tags inside comments?

>>> You are probably right. For me it boils down to these problems:
>>> - Implementing a stack for large queues of documents which is faster
>>> than list.pop(index) (Is there a lib for this?)
>> 
>> If you need a queue then use one:  take a look at `collections.deque` or
>> the `Queue` module in the standard library.
> 
> Which of the two would you recommend for handling large queues with fast
> response times?

`Queue.Queue` builds on `collections.deque` and is thread safe.  Speedwise
I don't think this makes a difference as the most time is spend with IO
and parsing.  So if you make your spider multi-threaded to gain some speed
go with `Queue.Queue`.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list