Python3: Reading a text/binary mixed file

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Mar 10 01:56:05 EDT 2015


Paulo da Silva wrote:

> Hi!
> 
> What is the best way to read a file that begins with some few text lines
> and whose rest is a binary stream?
> 
> As an exmaple ... files .pnm.
> 
> Thanks for any comments/help on this.


A mixed text/binary file is really a binary file that contains some binary 
data which is meant to be interpreted as text. (Just as other binary data is 
meant to be interpretered as floats, or ints, or pixel colours, or sound 
samples...)

I would open the file in binary mode, then use readline() to extract the 
first few lines. If there is any chance that the lines could use Windows 
line endings, then you'll need to handle that yourself. Chances are you will 
call line.strip() to remove the trailing newline, and that will also remove 
the trailing carriage return, so that isn't hard.

Strictly speaking, the lines you read will be *bytes*, not text, but if they 
are pure ASCII you won't notice any difference: byte strings in Python are 
displayed as if they were ASCII.

If the lines are supposed to be encoded in some encoding, say Latin-1, or 
UTF-8, you can convert to text strings:

line = line.decode('utf-8')

for example. Read the documentation for the file format to learn what 
encoding you should use. If it isn't documented, the answer is probably 
ASCII or Latin-1. Remember that the ASCII encoding in Python is strictly 7-
bit, so you'll get decoding errors if the strings contain bytes with the 
high-bit set. If you don't mind the risk of getting moji-bake, the "no 
brainer" solution is to use Latin-1 as the encoding.

http://en.wikipedia.org/wiki/Mojibake


Once you know there are no more lines, just swap to using the read() method 
instead of readline(). Something like this should work:


with open(somefile, "rb") as f:
    process_text(f.readline().decode('latin-1'))
    process_text(f.readline().decode('latin-1'))
    process_text(f.readline().decode('latin-1'))
    data = f.read(10000)
    while data:
        process_binary(data)
        data = f.read(10000)


-- 
Steve




More information about the Python-list mailing list