[Tutor] HTML Parsing
Alan Gauld
alan.gauld at btinternet.com
Wed May 28 13:50:57 CEST 2014
On 28/05/14 11:42, Mitesh H. Budhabhatti wrote:
> Hello Friends,
>
> I am using Python 3.3.3 on Windows 7. I would like to know what is the
> best method to do HTML parsing? For example, I want to connect to
> www.yahoo.com <http://www.yahoo.com> and get all the tags and their values.
The standard library contains a parser module:
html.parser
Which can do what you want, although its a non-trivial exercise.
Basically you define event handler functions for each type of
parser event. In your case you need handlers for starttag and
data, and maybe, endtag.
Within start-tag you can read the attributes to determine the
tag type so it typically looks like
def handle_starttag(self, name, attributes):
if name == 'p':
# process paragraph tag
elif name == 'tr':
# process table row
etc...
However, you might find it easier to use BeautifulSoup which is a
third-party package you need to download. Soup tends to handle
badly formed HTML better than the standard parser and works by
reading the whole HTML document into a tree like structure which
you can access, search or traverse...
HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list