[Tutor] (no subject)

Daniel Yoo dyoo@hkn.eecs.berkeley.edu
Wed, 18 Apr 2001 14:34:21 -0700 (PDT)


On Wed, 18 Apr 2001, wong chow cheok wrote:

> hello ya all. i have a problem again. still trying to extract url from the 
> web. but now i need to extract multiple url and not just one. i ahve tried 
> using findall() but all i get is the 'http' and nothing else.

Hmm... I have to take a look at the findall() documentation...

(http://python.org/doc/current/lib/Contents_of_Module_re.html)

"findall(pattern, string) Return a list of all non-overlapping matches of
pattern in string. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern has
more than one group. Empty matches are included in the result."


Ah, ok.  What they mean by "group" is anything surrounded by parentheses.  
In regular expressions, a pair of parentheses form a group that can be
treated as a single thing.  If we look at the regular expression, we can
see that there's parentheses around the "http|https|ftp|..." stuff.  
findall() is a bit sensitive towards groups: if it sees them, it will
think that the groups are all we are interested it, which explains why
we're getting only the "http" part of an url.


We'll need to put the whole url regular expression in one large group, to
get findall() to work properly:


###
http_url=r'''
(
     (http|https|ftp|wais|telnet|mailto|gopher|file)
     :
     [\w.#@&=\,-_~/;:\n]+
)
     (?=([,.:;\-?!\s]))
'''
###

Just to clarify: the url above has three groups: the first is the one that
encloses the whole regular expression.  The second surrounds all the
protocol types (http/https...).  Finally, the last is called the lookahead
group, which I'll have to skip: I need to think about it a little more.

(Also, there's some bugs in the regular expression itself that doesn't let
it find urls perfectly... whoops!)



Let's see if this sorta works:

###
>>> http_re.findall("http://www.hotmail.com abnd http://www.my.com")
[('http://www.hotmail.com', 'http', ' '), ('http://www.my', 'http', '.')]
###

So it almost works, but there's bugs in the regular expression that need
to be fixed.  I'm not quite as certain that the lookahead (?=...) is doing
the right thing; I'll need to look at it later.



> this was very helpful and very confusing. after reading more on it i am only 
> more confused with all the simbols. (?=...) what does this mean. i read it 

Regular expressions are meant to be useful for computers --- they're not
really meant for humans, so don't worry if the symbols are confusing: they
take a bit of practice to figure out what's happening.

OReilly publishes a book called "Mastering Regular Expressions" which I've
heard is really good if you're trying to learn regular expressions.  Take
a look to see if it's useful for you.

Good luck!