[issue40762] Writing bytes using CSV module results in b prefixed strings

Mon May 25 10:05:32 EDT 2020

Rémi Lapeyre <remi.lapeyre at henki.fr> added the comment:

> in real-life that b-prefixed string is just not readable by another program in an easy way

If another program opens this CSV file, it will read the string "b'A'" which is what this field actually contains. Everything that is not a number or a string gets converted to a string:

In [1]: import collections, dataclasses, random, secrets, io, csv 
   ...:  
   ...: Point = collections.namedtuple('Point', 'x y') 
   ...:  
   ...: @dataclasses.dataclass 
   ...: class Valar: 
   ...:     name: str 
   ...:     age: int 
   ...:  
   ...: a = Point(1, 2) 
   ...: b = Valar('Melkor', 2900) 
   ...: c = secrets.token_bytes(4) 
   ...:  
   ...: out = io.StringIO() 
   ...: f = csv.writer(out) 
   ...: f.writerow((a, b, c)) 
   ...:  
   ...: out.seek(0) 
   ...: print(out.read()) 
   ...:                                                                                                                                                                
"Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2'

Here another would find three fields, all strings: "Point(x=1, y=2)", "Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get actual objects instead of strings when reading the two first fields?

> Incase it fails to decode using that, then it will throw a UnicodeDecodeError

I read your PR, but succeeding to decode it does not mean it's correct:

   In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin')                                                                                                                        
   Out[4]: 'rÃ©sumÃ©'

It worked, but is it the appropriate encoding? Probably not

   In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8')                                                                                                                         
   Out[5]: 'résumé'

If you want to be able to save bytes, the best way is to use a format that can roundtrip bytes like parquet:

    In [18]: df = pd.DataFrame.from_dict({'a': [b'a']})                                                                                                                    

    In [19]: df.to_parquet('foo.parquet')                                                                                                                                  

    In [20]: type(pd.read_parquet('foo.parquet')['a'][0])                                                                                                                  
    Out[20]: bytes

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40762>
_______________________________________