Web Scraping - Output File

Kiuhnm kiuhnm03.4t.yahoo.it
Thu Apr 26 14:19:02 EDT 2012


On 4/26/2012 19:54, SMac2347 at comcast.net wrote:
> Hello,
> 
> I am having some difficulty generating the output I want from web
> scraping. Specifically, the script I wrote, while it runs without any
> errors, is not writing to the output file correctly. It runs, and
> creates the output .txt file; however, the file is blank (ideally it
> should be populated with a list of names).
> 
> I took the base of a program that I had before for a different data
> gathering task, which worked beautifully, and edited it for my
> purposes here. Any insight as to what I might be doing wrote would be
> highly appreciated. Code is included below. Thanks!
> 
> import os
> import re
> import urllib2
> 
> outfile = open("Skadden.txt","w")
> 
> A = 1
> Z = 26
> 
> for letter in range(A,Z):
> 
>      for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+str(letter)):

You need
  alphaSearch=a
but you're using
  alphaSearch=1

>              x = line
>              if '"><B>' in line:

You should search for ' ><B>'.

>                      start=x.find('"><B>"')

Ditto.

>                      end= x.find('</B></A></nobr></td>',start)
>                      name=x[start:end]

You should use start+5 to skip ' ><B>'.

>                      outfile.write(name+"\n")
>                      print name

Your code is bound to break over and over (you should do some smarter parsing), but here's a working version:

--->
import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = ord('a')
Z = ord('z')

for letter in range(A, Z):
    for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+chr(letter)):
            x = line
            if ' ><B>' in line:
                    start=x.find(' ><B>')
                    end= x.find('</B></A></nobr></td>',start)
                    name=x[start+5:end]
                    outfile.write(name+"\n")
                    print name
<---

Kiuhnm



More information about the Python-list mailing list