'r' vs 'rb' in csv (was Re: Python SHA-1 as a method for unique file identification ? [help!])

Mon Jun 26 20:37:48 EDT 2006

On 27/06/2006 6:39 AM, Mike Orr wrote:
> Tim Peters wrote:
>> [EP <eric.pederson at gmail.com>]
>>> This inquiry may either turn out to be about the suitability of the
>>> SHA-1 (160 bit digest) for file identification, the sha function in
>>> Python ... or about some error in my script
>> It's your script.  Always open binary files in binary mode.  It's a
>> disaster on Windows if you don't (if you open a file in text mode on
>> Windows, the OS pretends that EOF occurs at the first instance of byte
>> chr(26) -- this is an ancient Windows behavior that made an odd kind
>> of sense in the mists of history, and has persisted in worship of
>> Backward Compatibility despite that the original reason for it went
>> away _long_ ago).
> 
> On a semi-related note, I have a database on Linux that imports from a
> Macintosh CSV file.  The 'csv' module says to always open files in
> binary mode, but this didn't work in my case: I had to open it as 'rU'
> (text with universal newlines) or 'csv' misparsed it.  I'd like the
> program to be portable to Windows and Mac.  Is there a way around this?
>  Will I really burn in hell for using 'rU'?

Yes, you will burn in hell for using any old kludge that gets results 
(by accident) instead of reading the manual to find a principled solution:

"""
lineterminator
The string used to terminate lines in the CSV file. It defaults to '\r\n'.
"""

In the case of a Mac CSV file, '\r' is probably required.

You will burn in hell for asking questions w/o supplying sufficient 
information, like (a) repr(first few lines of your Mac CSV file) (b) 
what was the result from the csv module ("didn't work" doesn't cut it).

> 
> What was the odd bit of sense?  I know you end console input by typing
> ctrl-Z, but I thought it was just like Unix ctrl-D which ends the input
> but doesn't actually insert that character.
> 

Pace timbot, the "ancient Windows behavior" was inherited via MS-DOS 
from CP/M. Sectors on disk were 128 bytes. File sizes were recorded as 
numbers of sectors, not numbers of bytes. The convention was that the 
end of a text file was indicated by ^Z.

You are correct, modern software shouldn't and usually doesn't 
gratuitously write ^Z to files, but there is is some software out there 
that still does, hence the preservation of the convention on reading.

More importantly for CSV files, the data may contain *embedded* CRs and 
LFs that the users had in their spreadsheet file. Reading that with "r" 
or "rU" will certainly result in "didn't work".

HTH,
John