[Tutor] Problems with encoding in BeautifulSoup

Mal Wanstall m.wanstall at gmail.com
Tue Aug 18 06:18:19 CEST 2009


On Tue, Aug 18, 2009 at 9:00 AM, Eduardo Vieira<eduardo.susan at gmail.com> wrote:
> Hello, I have this sample script from beautiful soup, but I keep
> getting an error because of encoding. I have google for solutions but
> I don't seem to understand. Even this is dealt in Beautiful Soup's doc
> but I am not able to understant/apply the solution successfully.
>
> from BeautifulSoup import BeautifulSoup
> import urllib2
> page = urllib2.urlopen('http://www.yellowpages.ca/search/si/1/Signs/QC')
>
> # if I change the url to
> http://www.yellowpages.ca/search/si/3/Signs/ON, it works because
> there's no french words...
>
> soup = BeautifulSoup(page)
>
> companies = soup('h2')
>
> print soup.originalEncoding
>
> print companies[:4]
>
> However, if I do this, I don't get errors even when there are accents:
> for title in companies:
>    print title
>
> Here is the Error output:
> utf-8
> Traceback (most recent call last):
>  File "C:\myscripts\encondingproblem.py", line 13, in <module>
>    print companies[:4]
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 373: ordinal not in range(128)
>
> ===
> Thanks in advance.
>
> Eduardo
> www.expresssignproducts.com
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>

It's caused by Python not wanting to send non-ASCII characters to your
terminal. To override this you need to create a sitecustomize.py file
in your /usr/lib/python/ folder and put the following in it:

import sys
sys.setdefaultencoding("utf-8")

This will set the default encoding in Python to UTF8 and you should
stop getting these parsing errors. I dealt with this recently when I
was playing around with some international data.

-Mal W


More information about the Tutor mailing list