[Tutor] Fwd: Re: Parsing/Crawling test College Class Site.

Alan Gauld alan.gauld at btinternet.com
Tue Jun 2 12:07:46 CEST 2015


On 02/06/15 08:27, Alan Gauld wrote:

>> The following is a sample of the test code, as well as the url/posts
>> of the pages as produced by the Firefox/Firebug process.

I'm not really answering your question but addressing some
issues in your code...

>> execfile('/apps/parseapp2/ascii_strip.py')
>> execfile('dir_defs_inc.py')

I'm not sure what these do but usually its better to
import the files as modules then execute their
functions directly.

>> appDir="/apps/parseapp2/"
>>
>> # data output filename
>> datafile="unlvDept.dat"
>>
>>
>> # global var for the parent/child list json
>> plist={}
>>
>>
>> cname="unlv.lwp"
>>
>> #----------------------------------------
>>
>> if __name__ == "__main__":
>> # main app

It makes testing (and reuse) easier if you put the main code
in a function called main() and then just call that here.

Also your code could be broken up into smaller functions
which again will make testing and debugging easier.

>>   #
>>   # get the input struct, parse it, determine the level
>>   #
>>
>>   cmd="echo '' > "+datafile
>>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>>   res=proc.communicate()[0].strip()

Its easier and more efficient/reliable to create the
file directly from Python. Calling the subprocess modyule
each time starts up extra processes.

Also you store the result but never use it...

>>   cmd="echo '' > "+cname
>>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>>   res=proc.communicate()[0].strip()

See above

>>
>>   cmd='curl -vvv  '
>>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>>   cmd=cmd+'-L "http://www.lonestar.edu/class-search.htm"'

You build up strings like this many times but its very inefficient. 
There are several better options:
1) create a list of substrings then use join() to convert
    the list to a string.
2) use a triple quoted string to  create the string once only.

And since you are mostly passing them to Popen look at the
docs to see how to pass a list of args instead of one large
string, its more secure and generally better practice.

>>   cmd='curl -vvv  '
>>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>>   cmd=cmd+'-L "https://campus.lonestar.edu/classsearch.htm"'
>>
>>    #initial page
>>   cmd='curl -vvv  '
>>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>>   cmd=cmd+'-L
>> "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
>>
>>   proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
>>   res2=proc.communicate()[0].strip()
>>
>>   print res2
>>
>>   sys.exit()

Since this is non conditional you always exit here so nothing
else ever gets executed. This may be the cause of your problem?

>>   # s contains HTML not XML text
>>   d = libxml2dom.parseString(res2, html=1)
>>
>>   #-----------Form------------
>>
>>   selpath="//input[@id='ICSID']//attribute::value"
>>
>>   sel_ = d.xpath(selpath)
>>
>>
>>   if (len(sel_) == 0):
>>     sys.exit()
>>
>>   val=""
>>   ndx=0
>>   for a in sel_:
>>     val=a.textContent.strip()
>>
>>   print val
>>   #sys.exit()
>>
>>   if(val==""):
>>     sys.exit()
>>
>>
>>   #build the 1st post
>>
>>   ddd=1
>>
>>   post=""

This does nothing since you immediately replace it with the next line.

>>   post="ICAJAX=1"
>>   post=post+"&ICAPPCLSDATA="
>>   post=post+"&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241"
>>   post=post+"&ICActionPrompt=false"
>>   post=post+"&ICAddCount="
>>   post=post+"&ICAutoSave=0"
>>   post=post+"&ICBcDomData=undefined"
>>   post=post+"&ICChanged=-1"
>>   post=post+"&ICElementNum=0"
>>   post=post+"&ICFind="
>>   post=post+"&ICFocus="
>>   post=post+"&ICNAVTYPEDROPDOWN=0"
>>   post=post+"&ICResubmit=0"
>>   post=post+"&ICSID="+urllib.quote(val)
>>   post=post+"&ICSaveWarningFilter=0"
>>   post=post+"&ICStateNum="+str(ddd)
>>   post=post+"&ICType=Panel"
>>   post=post+"&ICXPos=0"
>>   post=post+"&ICYPos=114"
>>   post=post+"&ResponsetoDiffFrame=-1"
>>   post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
>>   post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
>>   post=post+"&TargetFrameName=None"

Since these are all hard coded strings you might as well
have just hard coded the final string and saved a lot
of processing. (and code space)

>>   cmd='curl -vvv  '
>>   cmd=cmd+'-A  "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
>> Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
>>   cmd=cmd+'   --cookie-jar '+cname+' --cookie '+cname+'    '
>>   cmd=cmd+'-e
>> "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"

This looks awfully similar to the code up above. Could you have reused 
the command? Maybe with some parameters - check out string formatting 
operations. eg: 'This string takes %s as a parameter" % 'a string'

I'll stop here, its all getting  a bit repetitive.
Which is, in itself a sign that you need to create some functions.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list