Screen scraper to get all 'a title' elements

ryguy7272 ryanshuell at gmail.com
Wed Nov 25 18:37:15 EST 2015


On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote:
> On 25/11/15 23:48, ryguy7272 wrote:
> >> re.findall( r'\<a[^>]+title="(.+?)"', html )
> [ ... ]
> > Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> > Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
> >
> > Can you just please explain what it's doing???
> >
> 
> Yes it's a regular expression. Because RegEx's use the backslash as an 
> escape character, it is advisable to use the "raw string" prefix (r 
> before single/double/triple quote. To illustrate it with an example :
> 	>>> print "1\n2"
> 	1
> 	2
> 	>>> print r"1\n2"
> 	1\n2
> As the backslash escape character is "neutralized" by the raw string, 
> you can use the usual RegEx syntax at leisure :
> 
> \<a[^>]+title="(.+?)"
> 
> \<	was a mistake on my part, a single < is perfectly enough
> [^>]	is a class definition, and the caret (^) character indicates 
> negation. Thus it means : any character other than >
> +	incidates repetition : one or more of the previous element
> .	will match just anything
> .+"	is a _greedy_ pattern that would match anything until it encountered 
> a double quote
> 
> The problem with a greedy pattern is that it doesn't stop at the first 
> match. To illustrate :
>  >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>  >>> a.group()
> '"this is a test" class="test"'
> 
> It matches the first quote up to the last one.
> On the other hand, you can use the "?" modifier to specify a non-greedy 
> pattern :
> 
>  >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
> '"this is a test"'
> 
> It matches the first quote and stops looking for further matches after 
> the second quote.
> 
> Finally, the parentheses are used to indicate a capture group :
>  >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
> class="test"' )
>  >>> a.groups()
> ('is', 'test')
> 
> 
> You can find detailed explanations about Python regular expressions at 
> this page : https://docs.python.org/2/howto/regex.html
> 
> HTH,
> 
> -Grobu-



Wow!  Awesome!  I bookmarked that link!  
Thanks for everything!!!



More information about the Python-list mailing list