How to waste computer memory?

Sat Mar 19 05:31:45 EDT 2016

Steven D'Aprano <steve at pearwood.info>:

> One thing that NTFS gets right is that all path names are guaranteed
> to be well-formed, valid Unicode. I believe that they are stored in
> UTF-16, and unlike the ext file systems used on Linux, they are not
> arbitrary bytes.

<URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd31774
8%28v=vs.85%29.aspx> states that NTFS filenames disallow '\', '/', '.',
'?', '*' as well as '¥'. Apparently the ban on the yen symbol isn't
enforced by the FS.

I haven't found a direct statement whether NTFS internally enforces the
soundness of UTF-16 or if it is simply doing UCS-2.

<URL: https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069
%28v=vs.85%29.aspx>:

   Using the surrogate mechanism, UTF-16 can support all 1,114,112
   potential Unicode characters.

But Unicode doesn't contain 1,114,112 characters—the surrogates are
excluded from Unicode, and definitely cannot be encoded using
UTF-anything.

Furthermore, the page notes:

   Note Windows 2000 introduces support for basic input, output, and
   simple sorting of supplementary characters. However, not all system
   components are compatible with supplementary characters.

(Somewhat related, Python doesn't enforce the soundness of Unicode
because Python allows surrogate code points in strings.)

> I believe that HFS+ on Apple Macs goes one step further and guarantees
> that paths are always fully normalised, so that it's impossible to
> have (e.g.) two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã
> (U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE) in the same
> directory.
>
> Unfortunately, backwards compatibility is holding Linux file systems
> back...

Linux got lucky by not jumping the gun. We are still waiting for the
dust to settle.

Unicode made several (understandable but grave) mistakes along the way:

   * UCS-2

   * supplementary code points

   * BOM

   * endianness

   * normalization

We still don't know if the final result will be UCS-4 everywhere (with
all 2**32 code points allowed?!) or UTF-8 everywhere.

Marko