[Python-ideas] Iterating non-newline-separated files should be easier

Sun Jul 20 05:58:58 CEST 2014

On Saturday, July 19, 2014 6:42 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 20 July 2014 11:31, Chris Angelico <rosuav at gmail.com> wrote:
>>  On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan at gmail.com> 
> wrote:
>>>  At present, I'm genuinely unclear on
>>>  why someone would ever want to pass the "-0" option to the 
>>> other UNIX
>>>  utilities, which then makes it very difficult to have a sensible
>>>  discussion on how we should address that use case in Python.
>> 
>>  That one's easy. What happens if you use 'find' to list files, 
>> and
>>  those files might have \n in their names? You need another sep.
> 
> Yes, but having a newline in a filename is sufficiently weird that I
> find it hard to imagine a scenario where "fix the filenames" isn't 
> a
> better answer. Hence why I think the PEP needs to explain why the UNIX
> utilities considered this use case sufficiently non-obscure to add
> explicit support for it, rather than just assuming that the
> obviousness of the use case can be taken for granted.

First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length.

Second, "fix the filenames" is almost _never_ a better answer. If you're publishing a program for other people to use, you want to document that it won't work on some perfectly good files, and close their bugs as "Not a bug, rename your files if you want to use my software"? If the files are on a read-only filesystem or a slow tape backup, you really want to copy the entire filesystem over just so you can run a script on it?

Also, even if "fix the filenames" were the right answer, you need to write a tool to do that, and why shouldn't it be possible to use Python for that tool? (In fact, one of the scripts I wanted this feature for is a replacement for the traditional rename tool (http://plasmasturm.org/code/rename/). I mainly wanted to let people use regular expressions without letting them run arbitrary Perl code, as rename -e does, but also, I couldn't figure out how to rename "foo" to "Foo" on a case-preserving-but-insensitive filesystem in Perl, and I know how to do it in Python.)

At any rate, there are decades of tradition behind using -print0, and that's not going to change just because Python isn't as good as other languages at dealing with it. The GNU find documentation (http://linux.die.net/man/1/find) explicitly recommends, in multiple places, using -print0 instead of -print whenever possible. (For example, in the summary near the top, "If no expression is given, the expression -print is used (but you should probably consider using -print0 instead, anyway).")

And part of the reason for that is that many other tools, like xargs, split on any whitespace, not on newlines, if not given the -0 argument. Fortunately, all of those tools know how to handle backslash escapes, but unfortunately, find doesn't know how to emit them. (Actually, frustratingly, both BSD and SysV find have the code to do it, but not in a way you can use here.) So, if you're writing a script that uses find and might get piped to anything that handles input like xargs, you have to use -print0.

And that means, if you're writing a tool that might get find piped to it, you have to handle -print0, even if you're pretty sure nobody will ever have newlines for you to deal with, because they're probably going to want to use -print0 anyway, rather than figure out how your tool deals with other whitespace.