Parsing html :: output to comma delimited

William Park opengeometry at yahoo.ca
Sat Jul 16 19:41:33 EDT 2005


samuels <ssweber at gmail.com> wrote:
> Hello All,
> 
> I am a total python newbie, and I need help writing a script.
> 
> This is what I want to do:
> 
> There is a list of links at http://www.rentalhq.com/fulllist.asp.  Each
> link goes to a page like,
> http://www.rentalhq.com/store.asp?id=907%2F272%2D4425, that contains a
> company name, address, phone, and fax.  I want extract each page, parse
> this information, and export it to a comma delimited text file, or tab
> delimited.  The important information in each page is:
> 
> <table border="0" cellpadding="0" cellspacing="0"
> style="border-collapse: collapse" bordercolor="#111111" width="100%"
> id="AutoNumber1">
>   <tr>
>     <td width="100%" colspan="2">
>     <h2 style="text-align: center; margin-top:2; margin-bottom:2;
> line-height:14px" class="title">
>     <font size="4">United Rentals Inc.</font>
>     </h2>
> 
>     <h3 style="text-align: center; margin-top:4;
> margin-bottom:4">3401 Commercial Dr. 
>     Anchorage AK, 99501-3024
>     </h3>
>     <p style="text-align: center; margin-top:4; margin-bottom:4">
>     <a target="_blank"
> href="http://maps.google.com/maps?q=3401+Commercial+Dr%2E Anchorage AK
> 99501-3024 ">
> <!--    <a target="_blank"
> href="http://www.mapquest.com/maps/map.adp?city=Anchorage&state=AK&address=3401+Commercial+Dr.&zip=99501-3024&country=&zoom=8">-->
>     <img height="15" src="Scraps/Rental_Images/map.gif" width="33"
> border="0"></a>
>     </p>
>     </td>
>   </tr>
>   <tr>
>     <td width="50%" valign="top">
>     <p style="text-align: center; line-height:100%; margin-top:0;
> margin-bottom:0"> 
>     </p>
>     <p style="text-align: center; line-height: 100%; margin-top:0;
> margin-bottom:0">
>     <b>Phone</b> - 907/272-4425<br>
>      <b>Fax</b> - 907/272-9683 </p>
> 
> So from that I want output like :
> 
> United Rentals Inc.,3401 Commercial
> Dr.,Anchorage,AK,"995013024","9072724425","9072729683"
> 
> or
> 
> United Rentals Inc.     3401 Commercial
> Dr.     Anchorage       AK      995013024       9072724425      9072729683
> 
> 
> I have been messing around with beautiful soup
> (http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
> gotten very far. (specially because the html is so sloppy)
> 
> Any help would be really appreciated!  Just point me in the right
> direction, what to use, examples...  Thanks!

I'm sure others will give proper Python solution.  But, here, shell is
not a bad tool.

    lynx -dump 'http://www.rentalhq.com/store.asp?id=907%2F272%2D4425' | \
	awk '/Return to List of Rental Stores/,/To reserve an item/' | \
	sed -n -e '3p;5p;10p;11p'

gives me
    
    United Rentals Inc.
    3401 Commercial Dr.  Anchorage AK, 99501-3024
       Phone - 907/272-4425
       Fax - 907/272-9683

-- 
William Park <opengeometry at yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
	   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
	  http://freshmeat.net/projects/bashdiff/



More information about the Python-list mailing list