[Tutor] reading binary files

bob gailer bgailer at gmail.com
Wed Feb 4 23:01:31 CET 2009


eShopping wrote:
>
> The file is around 800 Mb but I can't get hold of it until next week 
> so suggest starting a new topic once I have a cut-down copy.
OK will wait with bated breath.
>
>> Well, did you read on? What reactions do you have?
>
> I did (finally) read on and I am still a little confused, though less 
> than before.  I guess the word UNFORMATTED means that the file has no 
> format 
Depends on what you mean by format. When you use % formatting in Python 
it is the same thing as a FORMATTED WRITE in FORTRAN - a set of 
directives that direct the translation of data to human readable text.

Files per se are a sequence of bytes. As such they have no "format". 
When we examine a file we attempt to make sense of the bytes.

Some of the bytes may represent ASCII printable characters - other 
not.The body of this email is a sequence of ASCII printable characters 
that make sense to you when you read them.

The file written UNFORMATTED has some ASCII printable characters that 
you can read (e.g. DISTANCE), some that you can recognize as letters, 
numbers, etc but are not English words, and non-printable characters 
that show up as "garbage" symbols or not at all. Those that are not 
"readable" are the internal representation of numbers.

> .... though it presumably has some structure? One major  hurdle is 
> that I am not really sure about the difference between a Python binary 
> file and a FORTRAN UNFORMATTED file so any pointers would be 
> gratefully received

There is no such thing as a "Python binary file". When you open a file 
with mode 'b' you are asking the file system to ignore line-ends. If you 
do not specify 'b' then the file system "translates" line-ends into \n 
when reading and translates \n back to line-ends. The reason for this is 
that different OS file systems use different codes for line-ends. By 
translating them to and from \n the Python program becomes OS independent.

Windows uses ctrl-M ctrl-J (carriage return - line feed; \x0d\x0a).
Linux/Unix uses ctrl-J (line feed; \x0a).
Mac uses ctrl-M (carriage return; \x0d).
Python uniformly translates these to \n (x0a)

When processing files written without line-ends (e.g. UNFORMATTED) there 
may be line-end characters or sequences that must NOT be treated as 
line-ends. Hence mode 'b'

Example:

 >>> x=open('x','w') # write "normal" allowing \n to be translated to 
the OS line end.
 >>> x.write("Hello\n")
 >>> x=open('x','rb') # read binary, avoiding translation.
 >>> x.read()
'Hello\r\n'

where \r = \x0d

-- 
Bob Gailer
Chapel Hill NC
919-636-4239


More information about the Tutor mailing list