binary or ascii

François Pinard pinard at iro.umontreal.ca
Mon Aug 28 18:04:42 EDT 2000


[Robert Schweikert]

> Is there a way (function or trick) to determine whether a given file is
> in ASCII or binary format?

The usual approach is to try to prove that the text is binary, and failing
that, to declare that it is not.  However, if the proof fails, one is
almost never fully sure.  Yet, some algorithms guess rather nicely.

If by ASCII, you really mean 7-bits (ASCII is really 7-bits :-), it is
usually OK to check that no character in the file has its eight bit set.
In most situations, you may also check that you have no `NUL' in the file,
and maybe none of the control characters which are "unusual" (we consider
here that `HT', `CR', `LF', `BS', `FF' are usual).

If you did not really meant ASCII, but "textual", then the problem becomes
more difficult.  It might be easier if you know in advance that the text is
English, but become more widely difficult if you do not know the language
or script in use, or possibly in use.  You might discover the width
of characters by using correlations (usually easy), get hints about the
natural language by studying entropy and recognising tool words (much less
evident already), and use a battery of specific tests for common encodings
(an art and a science :-).  As for charsets, as they exist behind encodings,
you might have to resort to more sophisticated clustering techniques.

A few programs, however, just rely on dumb tests, and have a moderate
success.  A common test is to read the first few kilobytes of text, and
establish the ratio of characters having their eight bit set with those
having their eight bit clear.  Over some threshold, they declare the
file as binary, and below it, it is declared as text.  Studying contents
idiosyncrasies, like the `file' program does with magic signatures, might
also help taking a decision.

Bah!  This is all too complex.  Just do like everybody, and check the file
extension :-).  Or look at the little icon besides the file name. :-) :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard




More information about the Python-list mailing list