Newbie question about text encoding

random832 at fastmail.us random832 at fastmail.us
Fri Mar 6 09:27:42 EST 2015


On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote:
> To prevent people from putting three paragraphs of lipsum in and
> calling it a username.

Limiting by UTF-8 bytes or UTF-16 units works just as well for that.

> So you truncate to the desired length, then if the first character of
> the trimmed-off section is a combining mark (based on its Unicode
> character types), you keep trimming until you've removed a character
> which isn't. Then, if you no longer have any content whatsoever,
> reject the name. Simple.

My entire point was that UTF-32 doesn't save you from that, so it cannot
be called a deficiency of UTF-16. My point is there are very few
problems to which "count of Unicode code points" is the only right
answer - that UTF-32 is good enough for but that are meaningfully
impacted by a naive usage of UTF-16, to the point where UTF-16 is
something you have to be "safe" from.



More information about the Python-list mailing list