HTML Parsing and Indexing

Mon Nov 13 18:12:15 EST 2006

mailtogops at gmail.com wrote:

>     I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear.  Are you _sure_ there's
still a need to do this thoroughly awkward task?  How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS
?