Web Scraping - Output File

SMac2347 at comcast.net SMac2347 at comcast.net
Thu Apr 26 16:47:00 EDT 2012


On Apr 26, 2:19 pm, Kiuhnm <kiuhnm03.4t.yahoo.it> wrote:
> On 4/26/2012 19:54, SMac2... at comcast.net wrote:
>
>
>
>
>
>
>
>
>
> > Hello,
>
> > I am having some difficulty generating the output I want from web
> > scraping. Specifically, the script I wrote, while it runs without any
> > errors, is not writing to the output file correctly. It runs, and
> > creates the output .txt file; however, the file is blank (ideally it
> > should be populated with a list of names).
>
> > I took the base of a program that I had before for a different data
> > gathering task, which worked beautifully, and edited it for my
> > purposes here. Any insight as to what I might be doing wrote would be
> > highly appreciated. Code is included below. Thanks!
>
> > import os
> > import re
> > import urllib2
>
> > outfile = open("Skadden.txt","w")
>
> > A = 1
> > Z = 26
>
> > for letter in range(A,Z):
>
> >      for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+str(letter)):
>
> You need
>   alphaSearch=a
> but you're using
>   alphaSearch=1
>
> >              x = line
> >              if '"><B>' in line:
>
> You should search for ' ><B>'.
>
> >                      start=x.find('"><B>"')
>
> Ditto.
>
> >                      end= x.find('</B></A></nobr></td>',start)
> >                      name=x[start:end]
>
> You should use start+5 to skip ' ><B>'.
>
> >                      outfile.write(name+"\n")
> >                      print name
>
> Your code is bound to break over and over (you should do some smarter parsing), but here's a working version:
>
> --->
> import os
> import re
> import urllib2
>
> outfile = open("Skadden.txt","w")
>
> A = ord('a')
> Z = ord('z')
>
> for letter in range(A, Z):
>     for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+chr(letter)):
>             x = line
>             if ' ><B>' in line:
>                     start=x.find(' ><B>')
>                     end= x.find('</B></A></nobr></td>',start)
>                     name=x[start+5:end]
>                     outfile.write(name+"\n")
>                     print name
> <---
>
> Kiuhnm

Great, thanks so much for your help!



More information about the Python-list mailing list