Python 3.2 has some deadly infection

Rustom Mody rustompmody at gmail.com
Thu Jun 5 23:11:07 EDT 2014


On Friday, June 6, 2014 4:22:22 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Jun 6, 2014 at 8:35 AM, Rustom Mody  wrote:
> > And then ask how Linux (in your and Stallman's sense) differs from
> > Windows in how the filesystem handles things like filenames?

> What are you testing of the kernel? Most of the kernel doesn't
> actually work with text at all - it works with integers, buffers of
> memory (which could be seen as streams of bytes, but might be almost
> anything), process tables, open file handles... but not usually text.
> To you, "EAGAIN" might be a bit of text, but to the Linux kernel, it's
> an integer (11 decimal, if I recall correctly). Is that some fancy new
> form of encoding? :)


| Thanks to the properties of UTF-8 encoding, the Linux kernel, the
| innermost and lowest-level part of the operating system, can
| handle Unicode filenames without even having the user tell it
| that UTF-8 is to be used. All character strings, including
| filenames, are treated by the kernel in such a way that THEY
| APPEAR TO IT ONLY AS STRINGS OF BYTES. Thus, it doesn't care and
| does not need to know whether a pair of consecutive bytes should
| logically be treated as two characters or a single one. The only
| risk of the kernel being fooled would be, for example, for a
| filename to contain a multibyte Unicode character encoded in such
| a way that one of the bytes used to represent it was a slash or
| some other character that has a special meaning in file
| names. Fortunately, as we noted, UTF-8 never uses ASCII
| characters for encoding multibyte characters, so neither the
| slash nor any other special character can appear as part of one
| and therefore there is no risk associated with using Unicode in
| filenames.
|  
| Filesystems found on Microsoft Windows machines (NTFS and FAT)
| are different in that THEY STORE FILENAMES ON DISK IN SOME
| PARTICULAR ENCODING. The kernel must translate this encoding to
| the system encoding, which will be UTF-8 in our case.
|  
| If you have Windows partitions on your system, you will have to
| take care that they are mounted with correct options. For FAT and
| ISO9660 (used by CD-ROMs) partitions, option utf8 makes the
| system translate the filesystem's character encoding to
| UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should
| also work).

[Emphases mine]

From: http://michal.kosmulski.org/computing/articles/linux-unicode.html



More information about the Python-list mailing list