[Tutor] Fwd: Re: Parsing/Crawling test College Class Site.
Alan Gauld
alan.gauld at btinternet.com
Tue Jun 2 12:07:46 CEST 2015
On 02/06/15 08:27, Alan Gauld wrote:
>> The following is a sample of the test code, as well as the url/posts
>> of the pages as produced by the Firefox/Firebug process.
I'm not really answering your question but addressing some
issues in your code...
>> execfile('/apps/parseapp2/ascii_strip.py')
>> execfile('dir_defs_inc.py')
I'm not sure what these do but usually its better to
import the files as modules then execute their
functions directly.
>> appDir="/apps/parseapp2/"
>>
>> # data output filename
>> datafile="unlvDept.dat"
>>
>>
>> # global var for the parent/child list json
>> plist={}
>>
>>
>> cname="unlv.lwp"
>>
>> #----------------------------------------
>>
>> if __name__ == "__main__":
>> # main app
It makes testing (and reuse) easier if you put the main code
in a function called main() and then just call that here.
Also your code could be broken up into smaller functions
which again will make testing and debugging easier.
>> #
>> # get the input struct, parse it, determine the level
>> #
>>
>> cmd="echo '' > "+datafile
>> proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>> res=proc.communicate()[0].strip()
Its easier and more efficient/reliable to create the
file directly from Python. Calling the subprocess modyule
each time starts up extra processes.
Also you store the result but never use it...
>> cmd="echo '' > "+cname
>> proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>> res=proc.communicate()[0].strip()
See above
>>
>> cmd='curl -vvv '
>> cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>> cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
>> cmd=cmd+'-L "http://www.lonestar.edu/class-search.htm"'
You build up strings like this many times but its very inefficient.
There are several better options:
1) create a list of substrings then use join() to convert
the list to a string.
2) use a triple quoted string to create the string once only.
And since you are mostly passing them to Popen look at the
docs to see how to pass a list of args instead of one large
string, its more secure and generally better practice.
>> cmd='curl -vvv '
>> cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>> cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
>> cmd=cmd+'-L "https://campus.lonestar.edu/classsearch.htm"'
>>
>> #initial page
>> cmd='curl -vvv '
>> cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>> cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
>> cmd=cmd+'-L
>> "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>>
>> proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>> res2=proc.communicate()[0].strip()
>>
>> print res2
>>
>> sys.exit()
Since this is non conditional you always exit here so nothing
else ever gets executed. This may be the cause of your problem?
>> # s contains HTML not XML text
>> d = libxml2dom.parseString(res2, html=1)
>>
>> #-----------Form------------
>>
>> selpath="//input[@id='ICSID']//attribute::value"
>>
>> sel_ = d.xpath(selpath)
>>
>>
>> if (len(sel_) == 0):
>> sys.exit()
>>
>> val=""
>> ndx=0
>> for a in sel_:
>> val=a.textContent.strip()
>>
>> print val
>> #sys.exit()
>>
>> if(val==""):
>> sys.exit()
>>
>>
>> #build the 1st post
>>
>> ddd=1
>>
>> post=""
This does nothing since you immediately replace it with the next line.
>> post="ICAJAX=1"
>> post=post+"&ICAPPCLSDATA="
>> post=post+"&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241"
>> post=post+"&ICActionPrompt=false"
>> post=post+"&ICAddCount="
>> post=post+"&ICAutoSave=0"
>> post=post+"&ICBcDomData=undefined"
>> post=post+"&ICChanged=-1"
>> post=post+"&ICElementNum=0"
>> post=post+"&ICFind="
>> post=post+"&ICFocus="
>> post=post+"&ICNAVTYPEDROPDOWN=0"
>> post=post+"&ICResubmit=0"
>> post=post+"&ICSID="+urllib.quote(val)
>> post=post+"&ICSaveWarningFilter=0"
>> post=post+"&ICStateNum="+str(ddd)
>> post=post+"&ICType=Panel"
>> post=post+"&ICXPos=0"
>> post=post+"&ICYPos=114"
>> post=post+"&ResponsetoDiffFrame=-1"
>> post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
>> post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
>> post=post+"&TargetFrameName=None"
Since these are all hard coded strings you might as well
have just hard coded the final string and saved a lot
of processing. (and code space)
>> cmd='curl -vvv '
>> cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>> cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
>> cmd=cmd+'-e
>> "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
This looks awfully similar to the code up above. Could you have reused
the command? Maybe with some parameters - check out string formatting
operations. eg: 'This string takes %s as a parameter" % 'a string'
I'll stop here, its all getting a bit repetitive.
Which is, in itself a sign that you need to create some functions.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list