Python 3.x stuffing utf-8 into SQLite db

wxjmfauth at gmail.com wxjmfauth at gmail.com
Tue Feb 10 03:23:33 EST 2015


Le mardi 10 février 2015 01:37:15 UTC+1, Skip Montanaro a écrit :
> On Mon, Feb 9, 2015 at 2:38 PM, Skip Montanaro <skip.mo... at gmail.com> wrote:
> On Mon, Feb 9, 2015 at 2:05 PM, Zachary Ware
> 
> <zachary.w... at gmail.com> wrote:
> 
> > If all else fails, you can try ftfy to fix things:
> 
> > http://ftfy.readthedocs.org/en/latest/
> 
> 
> 
> Thanks for the pointer. I would prefer to not hand-mangle this stuff
> 
> in case I get another database dump from my USMS friends. Something
> 
> like ftfy should help things "just work".
> 
> 
> 
> And indeed it did. Thanks Zachary.
> 

%%%%%%

ftfy: a mountain of absurdities. On top of this: ~buggy.

Everything works fine if it's done correctly. There is
nothing to fix. I have the feeling you are destroying a
correct data file, and later you try to correct what you have
destroyed.

Basically the same experiment from Matthew Ruffalo:
Office suite --> csv file saved as pd.txt.

>From my GUI interactive interpreter (py32).

>>> with open('pd.csv', encoding='utf-8') as f:
...     r = f.read()
...     
>>> print(r)
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"

>>>


Now what may happen, is that the terminal (the host system)
may not display all these chars correctly (Windows, Russion *x, ...).
In that case, one has to code correctly (Windows, Russion *x, ...)

Still with the same GUI interpreter:

>>> sys.stdout.sethostencoding('cp850')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"


>>> sys.stdout.sethostencoding('iso-8859-5')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"

This is exactly what can be observed in a web browser.

Just for the fun, in fact a no-op.

>>> sys.stdout.sethostencoding('utf-32-le')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"




More information about the Python-list mailing list