Need to know if a file as only ASCII charaters

Scott David Daniels Scott.Daniels at Acm.Org
Tue Jun 16 14:52:01 EDT 2009


norseman wrote:
> Scott David Daniels wrote:
>> Dave Angel wrote:
>>> Jorge wrote: ...
>>>> I'm making  a application that reads 3 party generated ASCII files, 
>>>> but some times the files are corrupted totally or partiality and 
>>>> I need to know if it's a ASCII file with *nix line terminators.
>>>> In linux I can run the file command but the applications should run in
>>>> windows.
> you are looking for a \x0D (the Carriage Return) \x0A (the Line feed) 
> combination. If present you have Microsoft compatibility. If not you 
> don't.  If you think High Bits might be part of the corruption, filter 
> each byte with byte && \x7F  (byte AND'ed with hex 7F or 127 base 10) 
> then check for the \x0D \x0A combination.

Well  ASCII defines a \x0D as the return code, and \x0A as line feed.
It is unix that is wrong, not Microsoft (don't get me wrong, I know
Microsoft has often redefined what it likes invalidly).  If you
open the file with 'U', Python will return lines w/o the \r character
whether or not they started with it, equally well on both unix and
Microsoft systems.  Many moons ago the high order bit was used as a
parity bit, but few communication systems do that these days, so
anything with the high bit set is likely corruption.

> .... Intel uses one order and the SUN and  the internet another.  The
 > BIG/Little ending confuses many. Intel reverses the order of multibyte
 > numerics.  Thus- Small machine has big ego or largest byte value last.
 > Big Ending.  Big machine has small ego.
> Little Ending.  Some coders get the 0D0A backwards, some don't.  You 
> might want to test both.
> (2^32)(2^24)(2^16(2^8)  4 bytes correct math order  little ending
> Intel stores them (2^8)(2^16)(2^24)(2^32)   big ending
> SUN/Internet stores them in correct math order.
> Python will use \r\n (0D0A) and \n\r (0A0D) correctly.

This is the most confused summary of byte sex I've ever read.
There is no such thing as "correct math order" (numbers are numbers).
The '\n\r' vs. '\r\n' has _nothing_ to do with little-endian vs.
big-endian.  By the way, there are great arguments for each order,
and no clear winner.  Network order was defined for sending numbers
across a wire, the idea was that you'd unpack them to native order
as you pulled the data off the wire.

The '\n\r' vs. '\r\n' differences harken back to the days when they were
format effectors (carriage return moved the carriage to the extreme
left, line feed advanced the paper).  You needed both to properly
position the print head.  ASCII uses the pair, and defined the effect
of each.  As ASCII was being worked out, MIT even defined a "line
starve" character to move up one line just as line feed went down one.
The order of the format effectors most used was '\r\n' because the
carriage return involved the most physical motion on many devices, and
the vertical motion time of the line feed could happen while the
carriage was moving.  After that, you often added padding bytes 
(typically ASCII NUL ('\x00') or DEL ('\x7F')) to allow the hardware
time to finish before you the did spacing and printing.

--Scott David Daniels
Scott.Daniels at Acm.Org




More information about the Python-list mailing list