[Tutor] Trying to parse a HUGE(1gb) xml file in python
Stefan Behnel
stefan_ml at behnel.de
Tue Dec 21 10:10:52 CET 2010
David Hutto, 21.12.2010 09:55:
> On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
>> Chris Fuller, 21.12.2010 03:27:
>>>
>>> This isn't XML, it's an abomination of XML. Best to not treat it as XML.
>>> Good thing you're only after one class of tags. Here's what I'd do. I'll
>>> give a general solution, but there are two parameters / four cases that
>>> could
>>> make the code simpler, I'll just point them out at the end.
>>>
>>> Iterate over the file descriptor, reading in line-by-line. This will be
>>> slow
>>> on a huge file, but probably not so bad if you're only doing it once.
>>
>> Note that it's not unlikely that this is actually *slower* than using a real
>> XML parser:
>
> Or a 'real' language like C or C++ maybe to increase, or in Python's
> case, bypass, the interpreter?
While this may be a little faster than Python code (although I suspect that
benchmarking is needed to prove either way), I doubt that it's worth the
overhead in code writing. If I can write a couple of lines of Python code
that are easy to validate and almost as fast as C code, why would I want to
write and debug hundreds of lines of code in C or C++, just to see that I
need to tune my benchmark to notice the difference?
But then, people even write XML handling code in Java, where neither
performance nor code size is a suitable argument.
Stefan
More information about the Tutor
mailing list