[Expat-discuss] Character Encoding 4 bytes Limitation

Karl Waclawek karl at waclawek.net
Mon Aug 7 17:10:30 CEST 2006


chandan kumar wrote:
> Hi All,
>    
>    
>   The expat doc/reference.html mentions these limitation for character encoding.
>   -----
>   Expat places restrictions on character encodings that it can support by filling in the XML_Encoding structure. include file:
>    
>   2. Characters must be encoded in 4 bytes or less.
>   3. All characters encoded must have Unicode scalar values less than or equal to 65535 (0xFFFF)This does not apply to the built-in support for UTF-16 and UTF-8
>   ------
>    
>   Some of the chinese characters fall beyond this range. Does this mean that expat cannot parse all the chinese characters?
>   

Expat can parse all Chinese characters as long as they are encoded in 
UTF-16 or UTF-8.
These limitations only apply to non-Unicode encodings.
Someone has supplied an Expat patch to support the GB2312 encoding. See 
patch # 888879.

>    
>   Is there any expat document providing the list of characters supported? 
>   
There are source code comments in expat.h for the XML_Encoding 
structure, but not a list.

Karl


More information about the Expat-discuss mailing list