[issue43483] Loss of content in simple (but oversize) SAX parsing

Larry Trammell report at bugs.python.org
Wed Mar 17 12:56:50 EDT 2021


Larry Trammell <ridgerat at nwi.net> added the comment:

Sure...  I'll cut and paste some of the text I was organizing to go into a possible new issue page.

The only relevant documentation I could find was in the "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (as it has been through many versions):

-----------
ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data.  SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks...
-----------

As an example, here is a typical snippet taken from Web page

     https://www.tutorialspoint.com/parsing-xml-with-sax-apis-in-python 

The application example records the tag name "type" in the "CurrentData" member, and shortly thereafter, the "type" tag's content is received:

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content

Suppose that the parser receives the following text line from the input file.  

<type>SciFi</type>

Though there seems no reason for it, the parser could decide to deliver the content text as "Sc" followed by "iFi".  In that case, a second invocation of the "characters" method would overwrite the characters received in the first invocation, and some of the content text seems "lost."  

Given how rarely it happens, I suspect that when internal processing reaches the end of a block of buffered text from the input file, the easiest thing to do is to report any fragments of text that happen to remain at the end, no matter how tiny, and start fresh with the next internal buffer. Easy for the implementer, but baffling to the application developer.  And rare enough to elude application testing.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue43483>
_______________________________________


More information about the Python-bugs-list mailing list