Newbie question about text encoding

Fri Mar 6 09:11:20 EST 2015

On Sat, Mar 7, 2015 at 1:03 AM,  <random832 at fastmail.us> wrote:
> On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
>> Number of code points is the most logical way to length-limit
>> something. If you want to allow users to set their display names but
>> not to make arbitrarily long ones, limiting them to X code points is
>> the safest way (and preferably do an NFC or NFD normalization before
>> counting, for consistency);
>
> Why are you length-limiting it? Storage space? Limit it in whatever
> encoding they're stored in. Why are combining marks "pathological" but
> surrogate characters not? Display space? Limit it by columns. If you're
> going to allow a Japanese user's name to be twice as wide, you've got a
> problem when you go to display it.

To prevent people from putting three paragraphs of lipsum in and
calling it a username.

>> this means you disallow pathological cases
>> where every base character has innumerable combining marks added.
>
> No it doesn't. If you limit it to, say, fifty, someone can still post
> two base characters with twenty combining marks each. If you actually
> want to disallow this, you've got to do more work. You've disallowed
> some of the pathological cases, some of the time, by coincidence. And
> limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
> will accomplish this just as well.

They can, but then they're limited to two base characters. They can't
have fifty base characters with twenty combining marks each. That's
the point.

> Now, if you intend to _silently truncate_ it to the desired length, you
> certainly don't want to leave half a character in, of course. But who's
> to say the base character plus first few combining marks aren't also
> "half a character"? If you're _splitting_ a string, rather than merely
> truncating it, you probably don't want those combining marks at the
> beginning of part two.

So you truncate to the desired length, then if the first character of
the trimmed-off section is a combining mark (based on its Unicode
character types), you keep trimming until you've removed a character
which isn't. Then, if you no longer have any content whatsoever,
reject the name. Simple.

ChrisA