Python3: Reading a text/binary mixed file

Cameron Simpson cs at zip.com.au
Tue Mar 10 21:09:24 EDT 2015


On 10Mar2015 22:38, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>On 10-03-2015 04:14, Cameron Simpson wrote:
>> On 10Mar2015 04:01, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>>> But this is very tricky! I am on linux, but if I ran this program on
>>> windows I needed to change it to "eat" also the '\r'.
>>
>> If you're in Python 3 (recommended!) and you're parsing the headers as
>> text, you should be converting your split binary into strings anyway. So
>> you can just use .strip() or rstrip(); either will remove trailing '\r'
>> and '\n', so it will work in both UNIX and Windows.
>>
>I didn't know strip removes \r.

The documentation for str.split says it strips "whitespace" by default. In the 
string module doco it says:

  string.whitespace
    A string containing all ASCII characters that are considered
    whitespace.  This includes the characters space, tab, linefeed,
    return, formfeed, and vertical tab.

[...]
>> I presume you're gathering the headers in "binary" mode and decoding
>> each to a string. So you know the consumed length from the binary half;
>> that they're different lengths after decoding to strings is then
>> irrelevant.
>You are right.
>I am still a little confused about python3.

In this context the main point is that python 3 has a nice clean separation of 
str (as text) and bytes (as octet sized small ints). In general that makes it 
easier to work with in contexts like this because you are never confused about 
which you are dealing with.

Since binary files (returning bytes from reads) also have a convenient readline 
method looking for byte 10 ('\n') this makes you current task tractable: read 
"binary" lines, getting bytes objects ending in byte 10, then decode each 
bhytes object into str objects based on the text encoding (typically utf-8, or 
iso8859-1 or ascii for some protocols/formats not thinking strongly about bytes 
vs text).

Once decoded, you can then work on them as text without worrying about their 
former binary encoding.

Cheers,
Cameron Simpson <cs at zip.com.au>

Institutions will try to preserve the problem to which they are the solution.
- Clay Shirky, 2012



More information about the Python-list mailing list