convert xhtml back to html

Jim Washington jwashin at vt.edu
Fri Apr 25 08:46:53 EDT 2008


Stefan Behnel wrote:
> bryan rasmussen top-posted:
>   
>> On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:
>>     
>>>     from lxml import etree
>>>
>>>     tree = etree.parse("thefile.xhtml")
>>>     tree.write("thefile.html", method="html")
>>>
>>>  http://codespeak.net/lxml
>>>       
>> wow, that's pretty nice there.
>>
>>  Just to know: what's the performance like on XML instances of 1 GB?
>>     
>
> That's a pretty big file, although you didn't mention what kind of XML
> language you want to handle and what you want to do with it.
>
> lxml is pretty conservative in terms of memory:
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> But the exact numbers depend on your data. lxml holds the XML tree in memory,
> which is a lot bigger than the serialised data. So, for example, if you have
> 2GB of RAM and want to parse a serialised 1GB XML file full of little
> one-element integers into an in-memory tree, get prepared for lunch. With a
> lot of long text string content instead, it might still fit.
>
> However, lxml also has a couple of step-by-step and stream parsing APIs:
>
> http://codespeak.net/lxml/parsing.html#the-target-parser-interface
> http://codespeak.net/lxml/parsing.html#the-feed-parser-interface
> http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
>   
If you are operating with huge XML files (say, larger than available 
RAM) repeatedly, an XML database may also be a good option.

My current favorite in this realm is Sedna (free, Apache 2.0 license).  
Among other features, it has facilities for indexing within documents 
and collections (faster queries) and transactional sub-document updates 
(safely modify parts of a document without rewriting the entire 
document).  I have been working on a python interface to it recently 
(zif.sedna, in pypi).

Regarding RAM consumption, a Sedna database uses approximately 100 MB of 
RAM by default, and that does not change much, no matter how much (or 
how little) data is actually stored. 

For a quick idea of Sedna's capabilities, the Sedna folks have put up an 
on-line demo serving and xquerying an extract from Wikipedia (in the 
range of 20 GB of data) using a Sedna server, at 
http://wikidb.dyndns.org/ .  Along with the on-line demo, they provide 
instructions for deploying the technology locally. 

- Jim Washington




More information about the Python-list mailing list