Python 3.x stuffing utf-8 into SQLite db

Skip Montanaro skip.montanaro at gmail.com
Sun Feb 8 22:44:57 EST 2015


I am trying to process a CSV file using Python 3.5 (CPython tip as of a
week or so ago). According to chardet[1], the file is encoded as utf-8:

>>> s = open("data/meets-usms.csv", "rb").read()
>>> len(s)
562272
>>> import chardet
>>> chardet.detect(s)
{'encoding': 'utf-8', 'confidence': 0.99}

so I created the reader like so:

        rdr = csv.DictReader(open(csvfile, encoding="utf-8"))

This seems to work. The rows are read and records added to a SQLite3
database. When I go into sqlite3, I get what looks to be raw utf-8 on
output:

% LANG=en_US.UTF-8 sqlite3 topten.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> select * from swimmeet where meetname like '%Barracuda%';
sqlite> select count(*) from swimmeet;
0
sqlite> select count(*) from swimmeet;
4171
sqlite> select meetname from swimmeet where meetname like
'%Barracuda%Patrick%';
Anderson Barracudas St. Patrick's Day Swim Meet
Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
Anderson Barracuda Masters St. Patrick's Day Meet
Anderson Barracuda Masters St. Patrick's Day Meet 2014
Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet

Note the wacky three bytes where the apostrophe in "St. Patrick's" should
be. The data came to me as an XLSX spreadsheet, which I dumped to CSV using
LibreOffice. That's how the character was encoded at that point.

I tweaked my CSV-to-SQLite script to print the meet name and id for those
meets with "Barracuda" and "Patrick" in their name:

                if dry_run or verbose:
                    if ("Barracuda" in row["MeetTitle"] and
                        "Patrick" in row["MeetTitle"]):
                        print("Insert", n, row["MeetTitle"], row["MeetID"])

When I run it, I see raw bytes instead of a properly rendered apostrophe:

% LANG=en_US.utf-8 python3.5 src/usmsmeets2db.py -v data/meets-usms.csv
topten.db
Insert 1173 Anderson Barracudas St. Patrick's Day Swim Meet 20090321ABMSTPY
Insert 1559 Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
20100320CUDASY
Insert 1995 Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
20110319ANDERSY
Insert 3012 Anderson Barracuda Masters St. Patrick's Day Meet
20130316AndersY
Insert 3562 Anderson Barracuda Masters St. Patrick's Day Meet 2014
20140315ANDERSY
Insert 4114 Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet
20150321AndersY
Read 4962 rows, inserted 4171 records

Why am I not seeing what I believe to be a non-ASCII apostrophe of some
sort properly printed? This is running on a Mac (Yosemite) in its Terminal
app, with its encoding preference set to utf-8. It appears just as shown
above, "a" with a caret, the Euro symbol, then the "TM" symbol. Have I
perhaps lost the properly encoded bytes somewhere, and now it's just
spewing the bogus bytes (mojibake)?

Thanks,

Skip

--
[1] https://pypi.python.org/pypi/chardet/2.3.0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150208/91ab1302/attachment.html>


More information about the Python-list mailing list