Python3: Reading a text/binary mixed file
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Mar 10 01:56:05 EDT 2015
Paulo da Silva wrote:
> Hi!
>
> What is the best way to read a file that begins with some few text lines
> and whose rest is a binary stream?
>
> As an exmaple ... files .pnm.
>
> Thanks for any comments/help on this.
A mixed text/binary file is really a binary file that contains some binary
data which is meant to be interpreted as text. (Just as other binary data is
meant to be interpretered as floats, or ints, or pixel colours, or sound
samples...)
I would open the file in binary mode, then use readline() to extract the
first few lines. If there is any chance that the lines could use Windows
line endings, then you'll need to handle that yourself. Chances are you will
call line.strip() to remove the trailing newline, and that will also remove
the trailing carriage return, so that isn't hard.
Strictly speaking, the lines you read will be *bytes*, not text, but if they
are pure ASCII you won't notice any difference: byte strings in Python are
displayed as if they were ASCII.
If the lines are supposed to be encoded in some encoding, say Latin-1, or
UTF-8, you can convert to text strings:
line = line.decode('utf-8')
for example. Read the documentation for the file format to learn what
encoding you should use. If it isn't documented, the answer is probably
ASCII or Latin-1. Remember that the ASCII encoding in Python is strictly 7-
bit, so you'll get decoding errors if the strings contain bytes with the
high-bit set. If you don't mind the risk of getting moji-bake, the "no
brainer" solution is to use Latin-1 as the encoding.
http://en.wikipedia.org/wiki/Mojibake
Once you know there are no more lines, just swap to using the read() method
instead of readline(). Something like this should work:
with open(somefile, "rb") as f:
process_text(f.readline().decode('latin-1'))
process_text(f.readline().decode('latin-1'))
process_text(f.readline().decode('latin-1'))
data = f.read(10000)
while data:
process_binary(data)
data = f.read(10000)
--
Steve
More information about the Python-list
mailing list