Newbie question about text encoding

random832 at fastmail.us random832 at fastmail.us
Thu Mar 5 14:59:05 EST 2015


On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
> UTF-8
> and UTF-32, since that goes against the grain of the system. You would
> have
> to program in artificial restrictions that otherwise don't exist.

UTF-8 is already restricted from representing values above 0x10FFFF,
whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four
bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If
anything, the BMP represents a natural boundary, since it coincides with
values that can be represented in three bytes. Likewise, UTF-32 can
obviously represent values up to 0xFFFFFFFF. You're programming in
artificial restrictions either way, it's just a question of what those
restrictions are.



More information about the Python-list mailing list