cut strings and parse for images

Steve Holden steve at holdenweb.com
Mon Dec 6 15:58:37 EST 2004


Andreas Volz wrote:

> Hi,
> 
> I used SGMLParser to parse all href's in a html file. Now I need to cut
> some strings. For example:
> 
> http://www.example.com/dir/example.html
> 
> Now I like to cut the string, so that only domain and directory is
> left over. Expected result: 
> 
> http://www.example.com/dir/
> 
> I know how to do this in bash programming, but not in python. How could
> this be done?
> 
> The next problem is not only to extract href's, but also images. A href
> is easy:
> 
> <a href="install.php">Install</a>
> 
> But a image is a little harder:
> 
> <img class="bild" src="images/marine.jpg">
> 
> This is my current example code:
> 
> from sgmllib import SGMLParser
> 
> leach_url = "http://stargus.sourceforge.net/"
> 
> class URLLister(SGMLParser):
> 	def reset(self):
> 		SGMLParser.reset(self)
> 		self.urls = []
> 
> 	def start_a(self, attrs):
> 		href = [v for k, v in attrs if k=='href']
> 		if href:
> 			self.urls.extend(href)
> 
> if __name__ == "__main__":
> 	import urllib
> 	usock = urllib.urlopen(leach_url)
> 	parser = URLLister()
> 	parser.feed(usock.read())
> 	parser.close()
> 	usock.close()
> 	for url in parser.urls: 
> 		print url
> 
> 
> Perhaps you've some tips how to solve this problems?
> 
from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
	
	def reset(self):
		SGMLParser.reset(self)
		self.urls = []
		self.images = []

	def start_a(self, attrs):
		href = [v for k, v in attrs if k=='href']
		if href:
			self.urls.extend(href)

	def do_img(self, attrs):
		"We assume each image *has* a src attribute."
		for k, v in attrs:
			if k == 'src':
				self.images.append(v)
				break
		
		
if __name__ == "__main__":
	import urllib
	usock = urllib.urlopen(leach_url)
	parser = URLLister()
	parser.feed(usock.read())
	parser.close()
	usock.close()
	print "URLs:"
	for url in parser.urls:
		print url
	print "IMGs:"
	for img in parser.images:
		print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?group_id=119561&type=1

regards
  Steve
-- 
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119



More information about the Python-list mailing list