Working with HTML5 documents

Thu Nov 20 14:02:06 EST 2014

Tim schrieb am 20.11.2014 um 18:31:
> On Thursday, November 20, 2014 12:04:09 PM UTC-5, Denis McMahon wrote:
>>> On Wednesday, November 19, 2014 2:08:27 PM UTC-7, Denis McMahon wrote:
>>>> So what I'm looking for is a method to create an html5 document using
>>>> "dom manipulation", ie:
>>>>
>>>> doc = new htmldocument(doctype="HTML")
>>>> html = new html5element("html")
>>>> doc.appendChild(html)
>>>> head = new html5element("body")
>>>> html.appendChild(head)
>>>> body = new html5element("body")
>>>> html.appendChild(body)
>>>> title = new html5element("title")
>>>> txt = new textnode("This Is The Title")
>>>> title.appendChild(txt)
>>>> head.appendChild(title)
>>>> para = new html5element("p")
>>>> txt = new textnode("This is some text.")
>>>> para.appendChild(txt)
>>>> body.appendChild(para)
>>>>
>>>> print(doc.serialise())
>>>>
>>>> generates:
>>>>
>>>> <!doctype HTML><html><head><title>This Is The Title</title></
>>>> head><body><p>This is some text.</p></body></html>
>>>>
>>>> I'm finding various mechanisms to generate the structure from an
>>>> existing piece of html (eg html5lib, beautifulsoup etc) but I can't
>>>> seem to find any mechanism to generate, manipulate and produce html5
>>>> documents using this dom manipulation approach. Where should I be
>>>> looking?
>>
>> Everything there seems to assume I'll be creating a document serially, eg 
>> that I won't get to some point in the document and decide that I want to 
>> add an element earlier.
>>
>> bs4 and html5lib will parse a document into a tree structure, but they're 
>> not so hot on manipulating the tree structure, eg adding and moving nodes.
>>
>> Actually it looks like bs4 is going to be my best bet, although limited 
>> it does have most of what I'm looking for. I just need to start by giving 
>> it "<html></html>" to parse.
> 
> I believe lxml should work for this. Here's a snippet that I have used to create an HTML document:
> 
>     from lxml import etree
>     page = etree.Element('html')
>     doc = etree.ElementTree(page)
> 
>     head = etree.SubElement(page, 'head')
>     body = etree.SubElement(page, 'body')
>     table = etree.SubElement(body, 'table')
>     
>     etc etc
>    
>     with open('mynewfile.html', 'wb') as f:
>         doc.write(f, pretty_print=True, method='html')
> 
> (you can leave out the method= option to get xhtml).

There's also the E-factory for creating (sub-)trees and a nicely objectish way:

http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory

and the just released lxml 3.4.1 has an "htmlfile" context manager that
allows you to generate HTML incrementally:

http://lxml.de/api.html#incremental-xml-generation

Obviously, you can combine both, so you can create a subtree in memory and
write it into an incrementally built HTML stream. Pretty versatile.

Stefan