Manipulate HTML documents via data structure

C. Barnes connellybarnes at yahoo.com
Fri Oct 1 03:49:09 EDT 2004


Python provides HTML parsing through the
HTMLParser and htmllib modules.

For my application, I needed to search through
an HTML document in a nonlinear fashion and
dynamically change parts of the document.  The
most logical way to do this is to translate HTML
back and forth to a data structure.

I wrote a module called htmldata, available from:

http://oregonstate.edu/~barnesc/htmldata/

Example:

>>> from htmldata import dumps, loads
>>> o=loads('<img src=hi.gif alt="blah">foo</body>')
>>> o
[('img', {'src':'hi.gif', 'alt':'blah'}), 'foo',
('/body', {})]
>>> dumps(o)
'<img alt="blah" src="hi.gif">foo</body>'

Pros:
 * More powerful for HTML editing.
 * Easy to reproduce the original document (at least,
   a document that is HTML-equiv to the original).

Cons:
 * Less user friendly than HTMLParser module.

I tested it on several popular sites.  Feedback, bug
reports, etc appreciated.

 - Connelly Barnes



	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 



More information about the Python-list mailing list