Replacing utf-8 characters
Klaus Alexander Seistrup
klaus at seistrup.dk
Wed Oct 5 16:14:04 EDT 2005
Mike wrote:
> Hi, I am using Python to scrape web pages and I do not have problem
> unless I run into a site that is utf-8. It seems & is changed to
> & when the site is utf-8.
>
> [...]
> Any ideas?
How about using the universal feedparser from feedparser.org to fetch
and parse the RSS from Reuters? That's what I do and it works like a
charm.
#v+
>>> import feedparser
>>> rss = feedparser.parse('http://today.reuters.com/rss/topNews')
>>> for what in ('link', 'title', 'summary'):
... print rss.entries[0][what]
... print
...
http://today.reuters.com/news/newsarticle.aspx?type=topNews&storyid=2005-10-05T193846Z_01_DIT561620_RTRUKOC_0_US-COURT-SUICIDE.xml
Top court seems closely divided on suicide law
During arguments, the justices sharply questioned both sides on whether then-Attorney General John Ashcroft had the power under federal law in 2001 to bar distribution of controlled drugs to assist suicides, regardless of state law.
>>>
#v-
Cheers,
--
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/
More information about the Python-list
mailing list