[Tutor] Passing a config file to Python

Thu Mar 14 23:51:06 CET 2013

On 03/14/2013 02:22 PM, Irina I wrote:
> Hi all,
>
> I'm new to Python and am trying to pass a config file to my Python script. The config file is so simple and has only two URLs.
>
> The code should takes that configuration file as input and generates a single file in HTML format as output.
>
> The program must retrieve each web page in the list and extract all the <a> tag links from each page. It is only necessary to extract the <a> tag links from the landing page of the URLs that you have placed in your configuration file.
>
> The program will output an HTML file containing a list of clickable links from the source webpages and will be grouped by webpage. This is what I came up with so far, can someone please tell me if it's good?
>
> Thanks in advance.
>
> [CODE]
>
> - - - - - - - - config.txt - - - - - - - -
> http://www.blahblah.bla
> http://www.etcetc.etc
> - - - - - - - - - - - - - - - - - - - - - -
>
> - - - - - - - - linkscraper.py - - - - - - - -
> import urllib
>
> def get_seed_links():
> ...."""return dict with seed links, from the config file, as keys -- {seed_link: None, ... }"""
> ....with open("config.txt", "r") as f:
> ........seed_links = f.read().split('\n')

readline() is much clearer and accomplishes what you want.  Of course 
then you'd have to move the newline from each line.  But generally when 
you're reading in manually entered data, you want to do a strip() on 
each line anyway.

> ....return dict([(s_link, None) for s_link in seed_links])
>
> def get_all_links(seed_link):
> ...."""return list of links from seed_link page"""
> ....all_links = []
> ....source_page = urllib.urlopen(seed_link).read()
> ....start = 0
> ....while True:
> ........start = source_page.find("<a", start)
> ........if start == -1:
> ............return all_links
> ........start = source_page.find("href=", start)
> ........start = source_page.find("=", start) + 1
> ........end = source_page.find(" ", start)
> ........link = source_page[start:end]
> ........all_links.append(link)
>
> def build_output_file(data):
> ...."""build and save output file from data. data -- {seed_link:[link, ...], ...}"""
> ....result = ""
> ....for seed_link in data:
> ........result += "<h2>%s</h2>\n<break />" % seed_link

Perhaps by 'break' you really meant 'b' ??

> ........for link in data[seed_link]:
> ............result += '<a href="%s">%s</a>\n' % (link, link.replace("http://", ""))
> ........result += "<html /><html />"

You have no DOCTYPE header in your output file. The html tag pair need 
to surround the bulk of the file, not consist of a one-space content.
You have no header and body section.

> ....with open("result.htm", "w") as f:
> ........f.write(result)
>
> def main():
> ....seed_link_data = get_seed_links()
> ....for seed_link in seed_link_data:
> ........seed_link_data[seed_link] = get_all_links(seed_link)
> ....build_output_file(seed_link_data)
>
> if __name__ == "__main__":
> ....main()
>
> [/CODE]
>

You never specify which version of Python this is written for, nor what 
constraints there are on either the input html or output html.  Some 
comments are omitted, since they're version dependent.

Generally, your code is fragile as to what actual web pages would 
actually work.  Few websites actually try very hard to have valid html, 
and even much valid html could break your current assumptions.  Consider 
Beautiful Soup instead of urllib or urllib2.

Your source code would have to be carefully edited to change all those 
leading periods into spaces before it could even compile in Python. 
That stops any of us from actually trying it, or pieces of it.  So we 
can only comment by inspection.

-- 
DaveA