PEP 249 Compliant error handling

Wed Oct 18 15:15:41 EDT 2017

On 17/10/17 19:26, Israel Brewster wrote:
> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation where corrupted string data from the database can cause the module to error out. Specifically, when decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is "what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would it be better to set the errors parameter to something like "replace"? The former feels a bit more "proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option sort of feels like sweeping the problem under the rug, but does at least leave an error character in the string to l
>  et them know there was an error, and will allow retrieval of any good data.
> 
> Of course, if this was in my own code I could decide on a case-by-case basis what the proper action is, but since this a module that has to work in any situation, it's a bit more complicated.

The sqlite3 module falls back to returning bytes if there's a decoding
error. I don't know what the other modules do. It should be easy enough
for you to test this, though!

Python 3.5.3 (default, Jan 19 2017, 14:11:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import sqlite3

In [2]: db = sqlite3.connect("malformed-test1.sqlite3")

In [3]: db.execute("CREATE TABLE test (txt TEXT)")
Out[3]: <sqlite3.Cursor at 0x7f701e0378f0>

In [4]: db.execute("INSERT INTO test VALUES(?)", ("utf-8: é",))
Out[4]: <sqlite3.Cursor at 0x7f701e037b90>

In [5]: db.execute("INSERT INTO test VALUES(?)", ("latin1:
é".encode('latin1'),))
Out[5]: <sqlite3.Cursor at 0x7f701e037c70>

In [6]: db.execute("SELECT * FROM test").fetchall()
Out[6]: [('utf-8: é',), (b'latin1: \xe9',)]

In [7]: db.text_factory = bytes # sqlite3 extension to the API

In [8]: db.execute("SELECT * FROM test").fetchall()
Out[8]: [(b'utf-8: \xc3\xa9',), (b'latin1: \xe9',)]

For what it's worth, this is also what os.listdir() does when it
encounters filenames in the wrong encoding on operating systems where
this is possible (e.g. Linux, but not Windows)

If the encoding could be anything, I think you should give the user some
kind of choice between using bytes, raising errors, and escaping.

In the particular case of UTF-16 (especially if the encoding is always
UTF-16), the best solution is almost certainly to use
errors='surrogatepass' in both en- and decoding. I believe this is
fairly common practice when full interoperability with software that
predates UTF-16 (and previously used UCS-2) is required. This should
solve all your problems as long as you don't get strings with an odd
number of bytes.

See: https://en.wikipedia.org/wiki/UTF-16#U.2BD800_to_U.2BDFFF

-- Thomas