Working with HTML5 documents

Thu Nov 20 12:31:39 EST 2014

On Thursday, November 20, 2014 12:04:09 PM UTC-5, Denis McMahon wrote:
> On Wed, 19 Nov 2014 13:43:17 -0800, Novocastrian_Nomad wrote:
> 
> > On Wednesday, November 19, 2014 2:08:27 PM UTC-7, Denis McMahon wrote:
> >> So what I'm looking for is a method to create an html5 document using
> >> "dom manipulation", ie:
> >> 
> >> doc = new htmldocument(doctype="HTML")
> >> html = new html5element("html")
> >> doc.appendChild(html)
> >> head = new html5element("body")
> >> html.appendChild(head)
> >> body = new html5element("body")
> >> html.appendChild(body)
> >> title = new html5element("title")
> >> txt = new textnode("This Is The Title")
> >> title.appendChild(txt)
> >> head.appendChild(title)
> >> para = new html5element("p")
> >> txt = new textnode("This is some text.")
> >> para.appendChild(txt)
> >> body.appendChild(para)
> >> 
> >> print(doc.serialise())
> >> 
> >> generates:
> >> 
> >> <!doctype HTML><html><head><title>This Is The Title</title></
> >> head><body><p>This is some text.</p></body></html>
> >> 
> >> I'm finding various mechanisms to generate the structure from an
> >> existing piece of html (eg html5lib, beautifulsoup etc) but I can't
> >> seem to find any mechanism to generate, manipulate and produce html5
> >> documents using this dom manipulation approach. Where should I be
> >> looking?
> 
> > Use a search engine (Google, DuckDuckGo etc) and search for 'python
> > write html'
> 
> Surprise surprise, already tried that, can't find anything that holds the 
> document in the sort of tree structure that I want to manipulate it in.
> 
> Everything there seems to assume I'll be creating a document serially, eg 
> that I won't get to some point in the document and decide that I want to 
> add an element earlier.
> 
> bs4 and html5lib will parse a document into a tree structure, but they're 
> not so hot on manipulating the tree structure, eg adding and moving nodes.
> 
> Actually it looks like bs4 is going to be my best bet, although limited 
> it does have most of what I'm looking for. I just need to start by giving 
> it "<html></html>" to parse.
> 
> -- 
> Denis McMahon

I believe lxml should work for this. Here's a snippet that I have used to create an HTML document:

    from lxml import etree
    page = etree.Element('html')
    doc = etree.ElementTree(page)

    head = etree.SubElement(page, 'head')
    body = etree.SubElement(page, 'body')
    table = etree.SubElement(body, 'table')

    etc etc

    with open('mynewfile.html', 'wb') as f:
        doc.write(f, pretty_print=True, method='html')

(you can leave out the method= option to get xhtml).

hope that helps,
--Tim