ElementTree should parse string and file in the same way

Thu Jan 3 08:50:05 EST 2008

Fredrik Lundh wrote:
> Stefan Behnel wrote:
> 
>>> My take on the API decision in question was always that a file is
>>> inherently an XML *document*, while a string is inherently an XML
>>> *fragment*.
>>
>> Not inherently, no. I know some people who do web processing with an XML
>> document coming in as a string (from an HTTP request)  /.../
> 
> in which case you probably want to stream the raw XML through the parser
> *as it arrives*, to reduce latency (to do that, either parse from a
> file-like object, or feed data directly to a parser instance, via the
> consumer protocol).

It depends on the abstraction the web framework provides. If it allows you to
do that, especially in an event driven way, that's obviously the most
efficient implementation (and both ElementTree and lxml support this use
pattern just fine). However, some frameworks just pass the request content
(such as a POSTed document) in a dictionary or as callback parameters, in
which case there's little room for optimisation.

> also, putting large documents in a *single* Python string can be quite
> inefficient.  it's often more efficient to use lists of string fragments.

That's a pretty general statement. Do you mean in terms of reading from that
string (which at least in lxml is a straight forward extraction of a char*/len
pair which is passed into libxml2), constructing that string (possibly from
partial strings, which temporarily *is* expensive) or just keeping the string
in memory?

At least lxml doesn't benefit from iterating over a list of strings and
passing it to libxml2 step-by-step, compared to reading from a straight
in-memory string. Here are some numbers:

$$ cat listtest.py
from lxml import etree

# a list of strings is more memory expensive than a straight string
doc_list = ["<root>"] + ["<a>test</a>"] * 2000 + ["</root>"]
# document construction temporarily ~doubles memory size
doc = "".join(doc_list)

def readlist():
    tree = etree.fromstringlist(doc_list)

def readdoc():
    tree = etree.fromstring(doc)

$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readdoc()'
1000 loops, best of 3: 1.74 msec per loop

$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readlist()'
100 loops, best of 3: 2.46 msec per loop

The performance difference stays somewhere around 20-30% even for larger
documents. So, as expected, there's a trade-off between temporary memory size,
long-term memory size and parser performance here.

Stefan