Question concerning this list [WebCrawler]
Thomas Ploch
Thomas.Ploch at gmx.net
Sun Dec 31 06:15:05 EST 2006
Marc 'BlackJack' Rintsch schrieb:
> In <mailman.2166.1167535289.32031.python-list at python.org>, Thomas Ploch
> wrote:
>
>> Alright, my prof said '... to process documents written in structural
>> markup languages using regular expressions is a no-no.' (Because of
>> nested Elements? Can't remember) So I think he wants us to use regexes
>> to learn them. He is pointing to HTMLParser though.
>
> Problem is that much of the HTML in the wild is written in a structured
> markup language but it's in many cases broken. If you just search some
> words or patterns that appear somewhere in the documents then regular
> expressions are good enough. If you want to actually *parse* HTML "from
> the wild" better use the BeautifulSoup_ parser.
>
> .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Yes, I know about BeautifulSoup. But as I said it should be done with
regexes. I want to extract tags, and their attributes as a dictionary of
name/value pairs. I know that most of HTML out there is *not* validated
and bollocks.
This is how my regexes look like:
import re
class Tags:
def __init__(self, sourceText):
self.source = sourceText
self.curPos = 0
self.namePattern = "[A-Za-z_][A-Za-z0-9_.:-]*"
self.tagPattern = re.compile("<(?P<name>%s)(?P<attr>[^>]*)>"
% self.namePattern)
self.attrPattern = re.compile(
r"\s+(?P<attrName>%s)\s*=\s*(?P<value>\"[^\"]*\"|'[^']*')"
% self.namePattern)
>> You are probably right. For me it boils down to these problems:
>> - Implementing a stack for large queues of documents which is faster
>> than list.pop(index) (Is there a lib for this?)
>
> If you need a queue then use one: take a look at `collections.deque` or
> the `Queue` module in the standard library.
Which of the two would you recommend for handling large queues with fast
response times?
Thomas
More information about the Python-list
mailing list