[Tutor] list iteration question for writing to a file on disk

Fri Sep 14 11:56:57 CEST 2007

Hi

can someone help with this please?

i got to this point with help from the list.

from BeautifulSoup import BeautifulSoupdoc = ['<html><head><title>Page title</title></head>',       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',       '<a href="http://www.google.co.uk"></a>',       '<a href="http://www.bbc.co.uk"></a>',       '<a href="http://www.amazon.co.uk"></a>',       '<a href="http://www.redhat.co.uk"></a>',           '</html>']soup = BeautifulSoup(''.join(doc))alist = soup.findAll('a')
import urlparsefor a in alist:    href = a['href']    print urlparse.urlparse(href)[1]

so BeautifulSoup used to find <a> tags; use urlparse to extract to fully qualified domain name use print to print a nice list of hosts 1 per line. here
www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk

nice, so i think write them out to a file; change program to this to write to disk and read them back to see what's been done.

from BeautifulSoup import BeautifulSoupdoc = ['<html><head><title>Page title</title></head>',       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',       '<a href="http://www.google.co.uk"></a>',       '<a href="http://www.bbc.co.uk"></a>',       '<a href="http://www.amazon.co.uk"></a>',       '<a href="http://www.redhat.co.uk"></a>',           '</html>']soup = BeautifulSoup(''.join(doc))alist = soup.findAll('a')

import urlparseoutput = open("fqdns.txt","w")
for a in alist:    href = a['href']    output.write(urlparse.urlparse(href)[1])
output.close()

this writes out www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk

so I look in Alan's tutor pdf for issue and read page 120 where it suggests doing this; outp.write(line + '\n') # \n is a newline

so i change my line from this
    output.write(urlparse.urlparse(href)[1])
to this
    output.write(urlparse.urlparse(href)[1] + "\n")

I look at the output file and I get this

www.google.co.ukwww.bbc.co.ukwww.amazon.co.ukwww.redhat.co.uk

hooray I think, so then I open the file in the program to read each line to do something with it.
i pop this after the last output.close()

input = open("fqdns.txt","r")for j in input:    print j
input.close()

but his prints out 

www.google.co.uk

www.bbc.co.uk

www.amazon.co.uk

www.redhat.co.uk

Why do i get each record with an extra new line ? Am I writing out the records incorrectly or am I handling them incorrectly when I open the file and print do I have to take out newlines as I process?

any help would be great

s

_________________________________________________________________
Feel like a local wherever you go.
http://www.backofmyhand.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20070914/55a4ee06/attachment.htm