Issue with xml iterparse

bfrederi brfredericks at gmail.com
Thu Jun 3 16:44:11 EDT 2010


I am using lxml iterparse and running into a very obscure error. When
I run iterparse on a file, it will occasionally return an element that
has a element.text == None when the element clearly has text in it.

I copy and pasted the problem xml into a python string, used StringIO
to create a file-like object out of it, and ran a test using iterparse
with expected output, and it ran perfectly fine. So it only happens
when I try to run iterparse on the actual file.

So then I tried opening the file, reading the data, turning that data
into a file-like object using StringIO, then running iterparse on it,
and the same problem (element.text == None) occurred.

I even tried this:
f = codecs.open(abbyy_filename, 'r', encoding='utf-8')
file_data = f.read()
file_like_object = StringIO.StringIO(file_data)
for event, element in iterparse(file_like_object, events=("start",
"end")):

And I got this Traceback:
Traceback (most recent call last):
  File "abbyyParser/parseAbbyy.py", line 391, in <module>
    extension=options.extension,
  File "abbyyParser/parseAbbyy.py", line 103, in __init__
    self.generate_output_files()
  File "abbyyParser/parseAbbyy.py", line 164, in generate_output_files
    AbbyyDocParse(abby_filename, self.extension, self.output_types)
  File "abbyyParser/parseAbbyy.py", line 239, in __init__
    self.parse_doc(abbyy_filename)
  File "abbyyParser/parseAbbyy.py", line 281, in parse_doc
    for event, element in iterparse(file_like_object, events=("start",
"end")):
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__
(src/lxml/lxml.etree.c:86333)
TypeError: reading file objects must return plain strings

If I do this:
file_data = f.read().encode("utf-8")

iterparse will run on it, but I still get elements.text with a value
of None when I should not.

My XML file does have diacritics in it, but I've put the proper
encoding at the head of the XML file (<?xml version="1.0"
encoding="UTF-8"?>). I've also tried using elementree's iterparse, and
I get even more of the same problem with the same files. Any idea what
the problem might be?



More information about the Python-list mailing list