sgmllib.py

Stefan Behnel stefan_ml at behnel.de
Mon Aug 24 03:08:07 EDT 2009


elsa wrote:
> I'm new to both this forum and Python, and I've got a bit stuck trying
> to learn how to parse HTML...

If what you want to do is *parse* the HTML instead of trying to *learn* how
to parse it, you might want to give the existing (external) HTML parser
libraries a try. There's lxml.html (extremely fast and fixes up broken
HTML), html5lib (very slow, but very browser-like parse results) and
BeautifulSoup (slow, but good encoding detection if you need that).

Here are a couple of (only slightly biased) comparisons:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/


> python sgmllib.py "path/to/my/file.html"  .... example (1)
> 
> this doesn't work for me. I think I have figured out the problem  -
> the error says
> 
> "/System/Library/Frameworks/Python.framework/Versions/2.5/Resources/
> Python.app/Contents/MacOS/Python: can't open file 'sgmllib.py': [Errno
> 2] No such file or directory"
> 
> the problem is that this path is wrong. My sgmllib.py is in:
> 
> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
> python2.5/sgmllib.py"

You can use "python -m sgmllib" to call a module from the stdlib (or the
PYTHONPATH, to be more accurate).

But note that sgmllib is a particularly cumbersome way to deal with HTML.

Stefan



More information about the Python-list mailing list