Python 3 how to convert a list of bytes objects to a list of strings?

Fri Aug 28 09:15:19 EDT 2020

On 8/28/20 8:39 AM, Chris Green wrote:
> Richard Damon <Richard at damon-family.org> wrote:
>> On 8/28/20 7:50 AM, Karsten Hilbert wrote:
>>>>> No interpreation requires, since parsing failed. Then you can start
>>>>> dealing with these exceptions. _Do not_ write unparsable messages into
>>>>> an mbox!
>>>>>
>>>> Maybe I shouldn't but Python 2 has been managing to do so for several
>>>> years without any issues.
>>> I am inclined to congratulate you on that sheer amount of luck. I don't
>>> believe there were no issues because everything worked just right under
>>> py2 but rather because py2 cared less than py3 does now.
>>>
>>>> Are we saying that Python 3 really can't be made to handle things
>>>> 'tolerantly' like Python 2 used to?
>>> It sure should be possible but it will require *explicit* en/decode()s in
>>> more places than before because AFAICT there's less impliciteness as to
>>> which encoding to apply (regardless of whether it applies).
>>>
>>> Karsten
>>>
>>>
>>>
>> This might be one of the cases where Python 2's lack handling of string
>> vs bytes was an advantage.
>>
>> If he was just scanning the message for specific ASCII strings, then not
>> getting the full message decoded write is unlikely to have been causing
>> problems.
>>
>> Python2 handled that sort of case quite easily. Python 3 on the other
>> hand, will have issue converting the byte message to a string, since
>> there isn't a single encoding that you could use for all of it all the
>> time. This being 'fussier' does make sure that the program is handling
>> all the text 'properly', and would be helpful if some of the patterns
>> being checked for contained 'extended' (non-ASCII) characters.
>>
>> One possible solution in Python3 is to decode the byte string using an
>> encoding that allows all 256 byte values, so it won't raise any encoding
>> errors, just give your possibly non-sense characters for non-ASCII text.
>>
> But this will simply get some things quite wrong and produce garbage
> won't it?  Whereas Python 2 would simply scramble the odd character.
>
Yes, when the message has extended characters, it will put the 'wrong'
characters into the message, but if you are only looking for a fixed set
of ASCII strings, especially in the headers of the message, that doesn't
matter, those will still be there. It is a pragmatic short cut,
something that is 'good enough' to get the job done, even if not 100%
correct.

As was elsewhere mentioned, you could also do at least most of the
processing as bytes (this may need converting some of the strings being
uses to bytes), but I don't know exactly what they are doing, so don't
know if there is something that really needs a string.

Basically, mail messages are complicated, and to 'properly' convert a
message into a format for proper analysis would be significant work (and
the result would NOT be a simple string), but it sounds like they didn't
need that level of work with the messages. Particually if they only need
to process the headers, and are working to try to 'whitelist' files to
get them, being close and simple is likely good enough,

-- 
Richard Damon