[Tutor] HTML Parsing

Alan Gauld alan.gauld at btinternet.com
Wed May 28 13:50:57 CEST 2014


On 28/05/14 11:42, Mitesh H. Budhabhatti wrote:
> Hello Friends,
>
> I am using Python 3.3.3 on Windows 7.  I would like to know what is the
> best method to do HTML parsing?  For example, I want to connect to
> www.yahoo.com <http://www.yahoo.com> and get all the tags and their values.

The standard library contains a parser module:
html.parser

Which can do what you want, although its a non-trivial exercise.
Basically you define  event handler functions for each type of
parser event. In your case you need handlers for starttag and
data, and maybe, endtag.

Within start-tag you can read the attributes to determine the
tag type so it typically looks like

def handle_starttag(self, name, attributes):
    if name == 'p':
       # process paragraph tag
    elif name == 'tr':
       # process table row
    etc...


However, you might find it easier to use BeautifulSoup which is a 
third-party package you need to download. Soup tends to handle
badly formed HTML better than the standard parser and works by
reading the whole HTML document into a tree like structure which
you can access, search or traverse...

HTH
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos



More information about the Tutor mailing list