[Tutor] Passing a config file to Python

Prasad, Ramit ramit.prasad at jpmorgan.com
Thu Mar 14 19:59:48 CET 2013


Irina I
> Hi all,
> 
> I'm new to Python and am trying to pass a config file to my Python script. The config file is so
> simple and has only two URLs.
> 
> The code should takes that configuration file as input and generates a single file in HTML format as
> output.
> 
> The program must retrieve each web page in the list and extract all the <a> tag links from each page.
> It is only necessary to extract the <a> tag links from the landing page of the URLs that you have
> placed in your configuration file.
> 
> The program will output an HTML file containing a list of clickable links from the source webpages and
> will be grouped by webpage. This is what I came up with so far, can someone please tell me if it's
> good?
> 
> Thanks in advance.
> 

I would advise you to use a library like Beautiful Soup to parse
the HTML. You will find lots of badly formed pages and trying
to manually search for tags is error prone and frustrating. It is
nice to see that you are not using regular expressions, so your
solution will be much easier to debug. 

> [CODE]
> 
> - - - - - - - - config.txt - - - - - - - -
> http://www.blahblah.bla
> http://www.etcetc.etc
> - - - - - - - - - - - - - - - - - - - - - -
> 
> - - - - - - - - linkscraper.py - - - - - - - -
> import urllib
> 
> def get_seed_links():
> ...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... }"""
> ....with open("config.txt", "r") as f:
> ........seed_links = f.read().split('\n')

seed_links = f.readlines() # This is more descriptive with the same effect.

> ....return dict([(s_link, None) for s_link in seed_links])
> 
> def get_all_links(seed_link):
> ...."""return list of links from seed_link page"""
> ....all_links = []
> ....source_page = urllib.urlopen(seed_link).read()
> ....start = 0
> ....while True:
> ........start = source_page.find("<a", start)
> ........if start == -1:
> ............return all_links
> ........start = source_page.find("href=", start)
> ........start = source_page.find("=", start) + 1

Why keep using find? (Ignoring that I think
you should use a true HTML parser)

start = source_page.find("href=", start) + 7 # + len("href=") + delimiter

> ........end = source_page.find(" ", start)

What about links with a space in them? I have seen it before.
What if the link does not have a space between URL and closing 
tag?
    <a href="google.com"/>

> ........link = source_page[start:end]

Does this remove the ending quote?

> ........all_links.append(link)
> 
> def build_output_file(data):
> ...."""build and save output file from data. data -- {seed_link:[link, ...], ...}"""
> ....result = ""
> ....for seed_link in data:
> ........result += "<h2>%s</h2>\n<break />" % seed_link
> ........for link in data[seed_link]:

I think this would be better written using .iteritems() 
(Python 2) or .items() (Python3)

for seed_link, links in data.iteritems():
    result += "<h2>%s</h2>\n<break />" % seed_link
    for link in links:

> ............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://", ""))
> ........result += "<html /><html />"
> ....with open("result.htm", "w") as f:
> ........f.write(result)

In general string concatenation in this manner
is a bad idea because it is a quadratic process.
(It takes a lot more time and memory.)

The python idiom is to use '<delimiter>'.join(<list of strings>).
Make result a list and then do a `result.append(<some string>)`
instead of `result += <some string>`. When you finally need 
the string, do a `f.write(''.join(result))`.

I have given three examples below to illustrate how the
''.join() idiom works.

>>> '!#$#'.join( [ 'a', 'B', '4' ] )
'a!#$#B!#$#4'
>>> ''.join( [ 'a', 'B', '4' ] )
'aB4' 
>>> '_'.join( [ 'a', 'B', '4' ] )
'a_B_4'

> 
> def main():
> ....seed_link_data = get_seed_links()
> ....for seed_link in seed_link_data:
> ........seed_link_data[seed_link] = get_all_links(seed_link)
> ....build_output_file(seed_link_data)
> 
> if __name__ == "__main__":
> ....main()
> 
> [/CODE]


~Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  


More information about the Tutor mailing list