HTML Parsing and Indexing

Mon Nov 13 16:36:31 EST 2006

a combination of urllib, urlib2 and BeautifulSoup should do it.
Read BeautifulSoup's documentation to know how to browse through the
DOM.

mailtogops at gmail.com a écrit :

> Hi All,
>
>     I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc and sortlist them and create a bookmark on our website
> for the news content(we will use django for web development). Currently
> this project is under heavy development.
>
> I need a help on HTML parser.
>
> I can download the web pages from target sites. Then I have to start
> doing parsing. Since they all html web pages, they will have different
> styles, tags, it is very hard for me to parse the data. So what we plan
> is to have one or more rules for each website and run based on rule. We
> can even write some small amount of code for each web site  if
> required. But Crawler, Parser and Indexer need to run unattended. I
> don't know how to proceed next..
>
> I saw a couple of python parsers like pyparsing, yappy, yapps, etc but
> they havn't given any example for HTML parsing. Someone recommended
> using "lynx" to convert the page into the text and parse the data. That
> also looks good but still i end of writing a huge chunk of code for
> each web page.
>
> What we need is,
>
> One nice parser which should work on HTML/text file (lynx output) and
> work based on certain rules and return us a result (Am I need magix to
> do this :-( )
> 
> Sorry about my english..
> 
> Thanks & Regards,
> 
> Krish