PEP 249 Compliant error handling

MRAB python at mrabarnett.plus.com
Tue Oct 17 16:02:29 EDT 2017


On 2017-10-17 20:25, Israel Brewster wrote:
>
>> On Oct 17, 2017, at 10:35 AM, MRAB <python at mrabarnett.plus.com 
>> <mailto:python at mrabarnett.plus.com>> wrote:
>>
>> On 2017-10-17 18:26, Israel Brewster wrote:
>>> I have written and maintain a PEP 249 compliant (hopefully) DB API 
>>> for the 4D database, and I've run into a situation where corrupted 
>>> string data from the database can cause the module to error out. 
>>> Specifically, when decoding the string, I get a "UnicodeDecodeError: 
>>> 'utf-16-le' codec can't decode bytes in position 86-87: illegal 
>>> UTF-16 surrogate" error. This makes sense, given that the string 
>>> data got corrupted somehow, but the question is "what is the proper 
>>> way to deal with this in the module?" Should I just throw an error 
>>> on bad data? Or would it be better to set the errors parameter to 
>>> something like "replace"? The former feels a bit more "proper" to me 
>>> (there's an error here, so we throw an error), but leaves the end 
>>> user dead in the water, with no way to retrieve *any* of the data 
>>> (from that row at least, and perhaps any rows after it as well). The 
>>> latter option sort of feels like sweeping the problem under the rug, 
>>> but does at least leave an error character in the s
>> tring to
>> l
>>>  et them know there was an error, and will allow retrieval of any 
>>> good data.
>>> Of course, if this was in my own code I could decide on a 
>>> case-by-case basis what the proper action is, but since this a 
>>> module that has to work in any situation, it's a bit more complicated.
>> If a particular text field is corrupted, then raising 
>> UnicodeDecodeError when trying to get the contents of that field as a 
>> Unicode string seems reasonable to me.
>>
>> Is there a way to get the contents as a bytestring, or to get the 
>> contents with a different errors parameter, so that the user has the 
>> means to fix it (if it's fixable)?
>
> That's certainly a possibility, if that behavior conforms to the DB 
> API "standards". My concern in this front is that in my experience 
> working with other PEP 249 modules (specifically psycopg2), I'm pretty 
> sure that columns designated as type VARCHAR or TEXT are returned as 
> strings (unicode in python 2, although that may have been a setting I 
> used), not bytes. The other complication here is that the 4D database 
> doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, 
> and I don't know how well this is documented. So not only is the bytes 
> representation completely unintelligible for human consumption, I'm 
> not sure the average end-user would know what decoding to use.
>
> In the end though, the main thing in my mind is to maintain 
> "standards" compatibility - I don't want to be returning bytes if all 
> other DB API modules return strings, or visa-versa for that matter. 
> There may be some flexibility there, but as much as possible I want to 
> conform to the majority/standard/whatever
>
The average end-user might not know which encoding is being used, but 
providing a way to read the underlying bytes will give a more 
experienced user the means to investigate and possibly fix it: get the 
bytes, figure out what the string should be, update the field with the 
correctly decoded string using normal DB instructions.



More information about the Python-list mailing list