[Expat-discuss] Split UTF-8 sequence possible?

Mon Nov 10 11:38:45 EST 2003

> Having just overcome the newbie problem of not realizing that expat
> feeds UTF-8 sequences to my handlers, I'm now wondering if
> expat ever splits a multi-byte UTF-8 sequence across two calls to my
> character handler callback.
> 
> For example, say there's a non-ASCII accented character
> in its input character data (however it may have been encoded).
> expat will want to send me a two-byte UTF-8 sequence.  If there's
> only one byte left in the output buffer, will it (1) call my character 
> data
> callback with the buffer one short of capacity, and save the two-byte
> sequence for the next callback, or (2) put the first of the two UTF-8
> bytes in the buffer, call my callback, and then put the second at the
> start of the buffer for the NEXT callback?
> 
> I'm really hoping #1. Can anybody confirm this?

I am pretty sure Expat reports complete characters, as nothing
else makes sense. There are no output buffer boundaries forced on Expat.

Karl