cut strings and parse for images
Steve Holden
steve at holdenweb.com
Mon Dec 6 15:58:37 EST 2004
Andreas Volz wrote:
> Hi,
>
> I used SGMLParser to parse all href's in a html file. Now I need to cut
> some strings. For example:
>
> http://www.example.com/dir/example.html
>
> Now I like to cut the string, so that only domain and directory is
> left over. Expected result:
>
> http://www.example.com/dir/
>
> I know how to do this in bash programming, but not in python. How could
> this be done?
>
> The next problem is not only to extract href's, but also images. A href
> is easy:
>
> <a href="install.php">Install</a>
>
> But a image is a little harder:
>
> <img class="bild" src="images/marine.jpg">
>
> This is my current example code:
>
> from sgmllib import SGMLParser
>
> leach_url = "http://stargus.sourceforge.net/"
>
> class URLLister(SGMLParser):
> def reset(self):
> SGMLParser.reset(self)
> self.urls = []
>
> def start_a(self, attrs):
> href = [v for k, v in attrs if k=='href']
> if href:
> self.urls.extend(href)
>
> if __name__ == "__main__":
> import urllib
> usock = urllib.urlopen(leach_url)
> parser = URLLister()
> parser.feed(usock.read())
> parser.close()
> usock.close()
> for url in parser.urls:
> print url
>
>
> Perhaps you've some tips how to solve this problems?
>
from sgmllib import SGMLParser
leach_url = "http://stargus.sourceforge.net/"
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img
$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&type=1
regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
More information about the Python-list
mailing list