Text-mode apps (Was :Who are the "spacists"?)

Sun Mar 26 18:09:43 EDT 2017

On Sun, Mar 26, 2017 at 6:57 PM, Chris Angelico <rosuav at gmail.com> wrote:
>
> In actual UCS-2, surrogates are entirely disallowed; in UTF-16, they *must* be
> correctly paired.

Strictly-speaking UCS-2 disallows codes that aren't defined by the
standard, but the kernel couldn't be that restrictive. Unicode was a
moving target in the period that NT was developed (1988-93). The
object manager simply allows any 16-bit code in object names, except
its path separator, backslash. Since a UNICODE_STRING is counted, even
NUL is allowed in object names. But that's uncommon and should be
avoided since the user-mode API uses null-terminated strings.

The file-system runtime library further restricts this by reserving
NUL, ASCII control codes, forward slash, pipe, and the wildcard
characters asterisk, question mark, double quote, less than, and
greater than. The rules are loosened for NTFS named streams, which
only reserve NUL, forward slash, and backslash.

>> Windows file systems are also UCS-2. For the most part it's not an
>> issue since the source of text and filenames will be valid UTF-16.
>
> I'm actually not sure on that one. Poking around on both Stack
> Overflow and MSDN suggests that NTFS does actually use UTF-16, which
> implies that lone surrogates should be errors, but I haven't proven
> this. In any case, file system encoding is relatively immaterial; it's
> file system *API* encoding that matters, and that means the
> CreateFileW function and its friends:

Sure, the file system itself can use any encoding, but Microsoft use a
permissive UCS-2 in its file systems. The API uses 16-bit WCHARs, and
except for a relatively small set of codes (assuming it uses the
FsRtl), the system generally doesn't care about the values. Let's
review the major actors.

CreateFile uses the runtime library in ntdll.dll to fill in an
OBJECT_ATTRIBUTES [1] with a UNICODE_STRING [2]. This is where the
current-directory handle is set as the attributes RootDirectory handle
for relative paths; where slash is replaced with backslash; and where
weird MS-DOS rules are applied, such as DOS device names and trimming
trailing spaces. Once it has a native object attributes record, it
calls the real system call NtCreateFile [3]. In kernel mode this in
turn calls the I/O manager function IoCreateFile [4], which creates an
open packet and calls the object manger function ObOpenObjectByName.

Now it's time for path parsing. In the normal case the system
traverses several object directories and object symbolic links before
finally arriving at an I/O device (e.g. \??\C: => \Global??\C: =>
\Device\HarddiskVolume2). Parsing the rest of the path is in the hands
of the I/O manager via the Device object's ParseProcedure. The I/O
manager creates a File object and an I/O request packet (IRP) for the
major function IRP_MJ_CREATE [5] and calls the driver for the device
stack via IoCallDriver [6]. If the device is a volume that's managed
by a file-system driver (e.g. ntfs.sys), the file-system parses the
remaining path to open or create the directory/file/stream and
complete the IRP. The object manager creates a handle for the File
object in the handle table of the calling process, and this handle
value is finally passed back to the caller.

[1]: https://msdn.microsoft.com/en-us/library/ff557749
[2]: https://msdn.microsoft.com/en-us/library/ff564879
[3]: https://msdn.microsoft.com/en-us/library/ff566424
[4]: https://msdn.microsoft.com/en-us/library/ff548418
[5]: https://msdn.microsoft.com/en-us/library/ff548630
[6]: https://msdn.microsoft.com/en-us/library/ff548336

The object manager only cares about its path separator, backslash,
until it arrives at an object type that it doesn't manage, such as a
Device object. If a file system uses the FsRtl, then the remaining
path is subject to Windows file-system rules. It would be ill-advised
to diverge from these rules.

> I *think* it's the naive (and very common) hybrid of UCS-2 and UTF-16

It's just the way the system evolved over time. UTF-16 wasn't
standardized until 1996 circa NT 4.0. Windows started integrating it
around NT 5 (Windows 2000), primarily for the GUI controls in the
windowing system that directly affect text processing for most
applications. It was good enough to leave most of the lower layers of
the system passively naive when it comes to UTF-16.