Use Regular Expressions to extract URL's

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Apr 30 05:03:30 EDT 2010


On Thu, 29 Apr 2010 23:53:06 -0700, Jimbo wrote:

> Hello
> 
> I am using regular expressions to grab URL's from a string(of HTML
> code). I am getting on very well & I seem to be grabbing the full URL
> [b]but[/b]
> I also get a '"' character at the end of it. Do you know how I can get
> rid of the '"' char at the end of my URL

Live dangerously and just drop the last character from string s no matter 
what it is:

s = s[:-1]


Or be a little more cautious and test first:

if s.endswith('"'):
    s = s[:-1]


Or fix the problem at the source. Using regexes to parse HTML is always 
problematic. You should consider using a proper HTML parser. Otherwise, 
try this regex:

r'"(http://(?:www)?\..*?)"'



-- 
Steven



More information about the Python-list mailing list