Replacing utf-8 characters

Wed Oct 5 16:14:04 EDT 2005

Mike wrote:

> Hi, I am using Python to scrape web pages and I do not have problem 
> unless I run into a site that is utf-8.  It seems & is changed to 
> & when the site is utf-8.
>
> 	[...]

> Any ideas?

How about using the universal feedparser from feedparser.org to fetch 
and parse the RSS from Reuters?  That's what I do and it works like a 
charm.

#v+

>>> import feedparser
>>> rss = feedparser.parse('http://today.reuters.com/rss/topNews')
>>> for what in ('link', 'title', 'summary'):
...     print rss.entries[0][what]
...     print
...
http://today.reuters.com/news/newsarticle.aspx?type=topNews&storyid=2005-10-05T193846Z_01_DIT561620_RTRUKOC_0_US-COURT-SUICIDE.xml

Top court seems closely divided on suicide law

During arguments, the justices sharply questioned both sides on whether then-Attorney General John Ashcroft had the power under federal law in 2001 to bar distribution of controlled drugs to assist suicides, regardless of state law.
>>> 

#v-

Cheers,

-- 
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/