[Tutor] scraping and saving in file SOLVED

Wed Dec 29 13:45:57 CET 2010

Tommy Kaas wrote:

> With Stevens help about writing and Peters help about import codecs - and
> when I used \r\n instead of \r to give me new lines everything worked. I
> just thought that \n would be necessary? Thanks.
> Tommy

Newline handling varies across operating systems. If you are on Windows and 
open a file in text mode your program sees plain "\n",  but the data stored 
on disk is "\r\n". Most other OSes don't mess with newlines.

If you always want "\r\n" you can rely on the csv module to write your data, 
but the drawback is that you have to encode the strings manually:

import csv
import urllib2 
from BeautifulSoup import BeautifulSoup 

html = urllib2.urlopen(
    'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)

with open('tabeltest.txt', "wb") as f:
    writer = csv.writer(f, delimiter="#")
    rows = soup.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        writer.writerow([unicode(col.string).encode("utf-8")
                         for col in cols])

PS: It took me some time to figure out how deal with beautifulsoup's flavour 
of unicode:

>>> import BeautifulSoup as bs
>>> s = bs.NavigableString(u"älpha")
>>> s
u'\xe4lpha'
>>> s.encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 430, in encode
    return self.decode().encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)
>>> unicode(s).encode("utf-8") # heureka
'\xc3\xa4lpha'