String character encoding when converting data from one type/format to another

Wed Jan 7 07:11:48 EST 2015

Jacob Kruger wrote:

> I'm busy using something like pyodbc to pull data out of MS access .mdb
> files, and then generate .sql script files to execute against MySQL
> databases using MySQLdb module, but, issue is forms of characters in
> string values that don't fit inside the 0-127 range - current one seems to
> be something like \xa3, and if I pass it through ord() function, it comes
> out as character number 163.
> 
> Now issue is, yes, could just run through the hundreds of thousands of
> characters in these resulting strings, and strip out any that are not
> within the basic 0-127 range, but, that could result in corrupting data -
> think so anyway.
> 
> Anyway, issue is, for example, if I try something like
> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or

"\xa3" already is a str; str("\xa3") is as redundant as 
str(str(str("\xa3"))) ;)

> str('\xa3').encode('latin7') - that last one is actually our preferred
> encoding for the MySQL database - they all just tell me they can't work
> with a character out of range.

encode() goes from unicode to byte; you want to convert bytes to unicode and 
thus need decode().

In this context it is important that you tell us the Python version. In 
Python 2 str.encode(encoding) is basically 

str.decode("ascii").encode(encoding)

which is why you probably got a UnicodeDecodeError in the traceback:

>>> "\xa3".encode("latin7")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/iso8859_13.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: 
ordinal not in range(128)

>>> "\xa3".decode("latin7")
u'\xa3'
>>> print "\xa3".decode("latin7")
£

Aside: always include the traceback in your posts -- and always read it 
carefully. The fact that "latin7" is not mentioned might have given you a 
hint that the problem was not what you thought it was.

> Any thoughts on a sort of generic method/means to handle any/all
> characters that might be out of range when having pulled them out of
> something like these MS access databases?

Assuming the data in Access is not broken and that you know the encoding
decode() will work.

> Another side note is for binary values that might store binary values, I
> use something like the following to generate hex-based strings that work
> alright when then inserting said same binary values into longblob fields,
> but, don't think this would really help for what are really just most
> likely badly chosen copy/pasted strings from documents, with strange
> encoding, or something:
> #sample code line for binary encoding into string output
> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\") +
> ", "

I would expect that you can feed bytestrings directly into blobs, without 
any preparatory step. Try it, and if you get failures show us the failing 
code and the corresponding traceback.