[Python-ideas] os.path.isbinary

Thu Aug 1 06:35:02 CEST 2013

From: Alexander Belopolsky <alexander.belopolsky at gmail.com>
Sent: Wednesday, July 31, 2013 7:57 PM

>On Wed, Jul 31, 2013 at 10:11 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>
>>Still can't be done reliably, but even if it could, what's so special about ASCII?

>Lots of things are special about ASCII.  It is a 7-bit subset of pretty much every modern encoding scheme.  Being 7-bit, it can be fairly reliably distinguished from most binary formats.   Same is true about UTF-8.   It is very unlikely that a binary dump of a double array make a valid UTF-8 text and vice versa - UTF-8 text interpreted as a list of doubles is unlikely to produce numbers that are in a reasonable range.
>
>I would not mind seeing an "istext()" function somewhere in the stdlib that would only recognize ASCII and UTF-8 as text. 

Plenty of files in popular charsets are actually perfectly valid UTF-8, but garbage when read that way. This and its converse are probably the most common cause of mojibake problems people have today. (I don't know if you can search Stack Overflow for problems with "Ã" in the description, but if you can, it'll be illuminating.) Do you really want a file that sorts half your Latin-1 files into "UTF-8 text files" that are unreadable garbage and the other half into "binary files"?

Also, while ASCII is much simpler and more robust to detect, it's not nearly as useful as it used to be. We don't have to deal with 7-bit data channels very often nowadays… and when you do, do you really want to treat pickle format 0 or base-64 or RTF as "text"? Meanwhile, text-processing code that only handles ASCII is generally considered broken.

Anyway, if you want that "istext()" function, it's trivial to write it yourself:

    def istext(b):
        try:
            b.decode('utf-8')
        except UnicodeDecodeError:
            return False
        else:
            return True

(There's no reason to try 'ascii', because any ASCII-decodable text is also UTF-8-decodable.)

And really, since you're usually going to do something like this:

    if istext(b):
        dotextstuff(b)
    else:
        dobinarystuff(b)

… you're probably better off following EAFP and just doing this:

    try:
        dotextstuff(b)
    except UnicodeDecodeError:
        dobinstuff(b)