Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Mar 5 17:33:41 EST 2015


random832 at fastmail.us wrote:

> On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
>> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
>> UTF-8
>> and UTF-32, since that goes against the grain of the system. You would
>> have
>> to program in artificial restrictions that otherwise don't exist.
> 
> UTF-8 is already restricted from representing values above 0x10FFFF,
> whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four
> bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If
> anything, the BMP represents a natural boundary, since it coincides with
> values that can be represented in three bytes. Likewise, UTF-32 can
> obviously represent values up to 0xFFFFFFFF. You're programming in
> artificial restrictions either way, it's just a question of what those
> restrictions are.

Good points, but they don't greatly change my conclusion. If you are
implementing UTF-8 or UTF-32, it is no harder to deal with code points in
the SMP than those in the BMP.


-- 
Steven




More information about the Python-list mailing list