[Expat-bugs] Expat occasionaly cropping multibyte character strings

Karl Waclawek karl at waclawek.net
Sat May 15 17:43:24 CEST 2010


On 15/05/2010 7:30 AM, Juraj Ivančić wrote:
> Expat does not handle multibyte characters correctly.
> Steps to reproduce this behaviour:
>
> 1) You need an input XML in e.g. UTF8 encoding which
> contains some multibyte characters (e.g. cyrillic characters)
>
> 2) Create an XML parser and feed it input file, but ensure
> that buffer breaks somewhere in the middle of a multibyte string.
> (To make sure - feed the parser one byte at a time).
>
> Say input file contains:
> '... <element>Соме валуе</element> ...'
>
> and it gets buffered like this:
>
> Buffer1: '... <element>Соме '
> Buffer2: 'валуе</element> ...'
>
> Expat parser will, when completing parsing Buffer1, invoke character
> data handler containing only partial ('Соме ') data, instead of waiting
> for the rest of the input. I think this is a bug as it only manifests
> when multibyte characters appear.

the way you describe it it is not a bug. Expat does not guarantee 
reporting the text between element tags as one string.
It would be a bug if Expat broke a single multi-byte character up into 
one or more parts.

Karl

-------------- next part --------------
A non-text attachment was scrubbed...
Name: karl.vcf
Type: text/x-vcard
Size: 179 bytes
Desc: not available
URL: <http://mail.libexpat.org/pipermail/expat-bugs/attachments/20100515/ad5b8e49/attachment.vcf>


More information about the Expat-bugs mailing list