[Python-Dev] When should pathlib stop being provisional?

Wed Apr 6 03:14:53 EDT 2016

On Apr 6, 2016 1:26 AM, "Chris Angelico" <rosuav at gmail.com> wrote:
>
> On Wed, Apr 6, 2016 at 3:37 PM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:
> > Chris Angelico writes:
> >
> >  > Outside of deliberate tests, we don't create files on our disks
> >  > whose names are strings of random bytes;
> >
> > Wishful thinking.  First, names made of control characters have often
> > been deliberately used by miscreants to conceal their warez.  Second,
> > in some systems it's all too easy to create paths with components in
> > different locales (the place I've seen it most frequently is in NFS
> > mounts).  I think that's much less true today, but perhaps that's only
> > because my employer figured out that it was much less pain if system
> > paths were pure ASCII so that it mostly didn't matter what encoding
> > users chose for their subtrees.
>
> Control characters are still characters, though. You can take a
> bytestring consisting of byte values less than 32, decode it as UTF-8,
> and have a series of codepoints to work with.
>
> If your employer has "solved" the problem by restricting system paths
> to ASCII, that's a fine solution for a single system with a single
> ASCII-compatible encoding; a better solution is to mandate UTF-8 as
> the file system encoding, as that's what most people are expecting
> anyway.
>
> > It remains important to be able to handle nearly arbitrary bytestrings
> > in file names as far as I can see.  Please note that 100 million
> > Japanese and 1 billion Chinese by and large still prefer their
> > homegrown encodings (plural!!) to Unicode, while many systems are now
> > defaulting filenames to UTF-8.  There's plenty of room remaining for
> > copying bytestrings to arguments of open and friends.
>
> Why exactly do they prefer these other encodings? Are they
> representing characters that Unicode doesn't contain? If so, we have a
> fundamental problem (no Python program is going to be able to cope
> with these, without a third party library or some stupid mess of local
> code); if not, you can always represent it as Unicode and encode it as
> UTF-8 when it reaches the file system. Re-encoding is something that's
> easy when you treat something as text, and impossible when you treat
> it as bytes.
>
> So far, you're still actually agreeing with me: paths are *text*, but
> sometimes we don't know the encoding (and that's a problem to be
> solved).

re: bytestring, unicode, encodings after e.g. os.path.split / Path.split:

from "[Python-ideas] Type hints for text/binary data in Python 2+3 code"

https://mail.python.org/pipermail/python-ideas/2016-March/038869.html

>> would/will it be possible to
use Typing.Text as a base class for even-more abstract string types

https://mail.python.org/pipermail/python-ideas/2016-March/039016.html

>> * Text.encoding
>> * Text.lang (urn:ietf:rfc:3066)
... forgot to CC:
>> * https://tools.ietf.org/html/rfc5646
  "Tags for Identifying Languages"
  urn:ietf:rfc:5646

is this (Path) a narrower case of string types (#strypes), because after
transformations we want to preserve string metadata like e.g encoding?

I'd vote for
* adding DirEntry.__path__ as a proxy to DirEntry.path
* standardizing on __path__ (over .path)
  * because this operation *is* fundamentally similar to e.g. __str__
    * operator.path pathify, pathifize

>
> ChrisA
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/wes.turner%40gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160406/f76702cc/attachment-0001.html>