HTML Parsing and Indexing

Andy Dingley dingbat at codesmiths.com
Mon Nov 13 18:12:15 EST 2006


mailtogops at gmail.com wrote:

>     I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear.  Are you _sure_ there's
still a need to do this thoroughly awkward task?  How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?




More information about the Python-list mailing list