Python 3.x stuffing utf-8 into SQLite db

Matthew Ruffalo mmr15 at case.edu
Mon Feb 9 12:54:00 EST 2015


On 02/09/2015 12:30 PM, Skip Montanaro wrote:
> Thanks, Chris. Are you telling me I should have defined the input file
> encoding for my CSV file as CP-1252, or that something got hosed on
> the export from XLSX to CSV? Or something else?
>
> Skip

Hi Skip-

I think it's most likely that the encoding issues happened in the export
from XLSX to CSV (unless the data is malformed in the original XLSX
file, of course). The file you're reading *is* valid UTF-8, and you're
reading it in text mode, so seeing '’' in certain lines implies that
the files might contain '’'.encode('utf-8') ==
b'\xc3\xa2\xe2\x82\xac\xe2\x84\xa2'. You could verify this with a hex
editor, or use something like the following:

"""
#!/usr/bin/env python3
from argparse import ArgumentParser
                                                                                                                       

def
dump_lines_with_high_bytes(filename):                                                                              

    with open(filename, 'rb') as
f:                                                                                    

        for line in
f:                                                                                                 

            if any(byte >= 0x80 for byte in
line):                                                                     
               
print(line)                                                                                            


if __name__ == '__main__':
    p = ArgumentParser()
    p.add_argument('filename')
    args = p.parse_args()
    dump_lines_with_high_bytes(args.filename)
"""

I'm a bit surprised that LibreOffice would mangle the encoding like this
-- the entire point of using a format like XLSX is to avoid encoding
issues. I just created a blank spreadsheet with Excel 2013 on Windows 7,
pasted a U+2019 RIGHT SINGLE QUOTATION MARK into the top-left cell,
saved it to a .xlsx file, and opened it with LibreOffice Calc 4.2.7.2 on
Linux. The quot character displayed correctly, and I was able to save it
to valid UTF-8 CSV that I could read with Python 3.4 without any mojibake.

I wonder whether the encoding problems happened with whatever data was
used to create your .xlsx file -- as Chris mentioned, this would occur
if UTF-8 bytes are incorrectly decoded as Windows cp1252.

MMR...



More information about the Python-list mailing list