String character encoding when converting data from one type/format to another

Wed Jan 7 07:22:03 EST 2015

On 01/07/2015 06:04 AM, Jacob Kruger wrote:
> I'm busy using something like pyodbc to pull data out of MS access .mdb files, and then generate .sql script files to execute

  against MySQL databases using MySQLdb module, but, issue is forms of 
characters in string values that don't fit inside

  the 0-127 range - current one seems to be something like \xa3, and if 
I pass it through ord() function,

  it comes out as character number 163.

First question, of course is what version of Python.  Clearly, you're 
not using Python 3.x, so I'll assume 2.7.  But you really should specify 
it in your query.

Next question is what OS you're using.  You're reading .mdb files, which 
are most likely created in Windows, but that doesn't guarantee you're 
actually using Windows to do this conversion.

>
> Now issue is, yes, could just run through the hundreds of thousands of characters in these resulting strings, and strip out any that are not within the basic 0-127 range, but, that could result in corrupting data - think so anyway.
>
> Anyway, issue is, for example, if I try something like str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or str('\xa3').encode('latin7') - that last one is actually our preferred encoding for the MySQL database - they all just tell me they can't work with a character out of range.
>

That's because your input data isn't ASCII.  So before you encode it, 
you have to decode it.  Any idea what encoding it's already in?  Maybe 
it's in latin1, which permits all 256 values.  Or utf-8, which permits a 
few hundred thousand values, but uses multiple bytes for any of those 
over 127.  Or any of hundreds of other encodings.

Does an .mdb file have a field specifying what encoding was used?  Or do 
you have some other external knowledge?

If you don't know what encoding it's currently in, you'll have to guess, 
and the guess you're using so far is ASCII, which you know is false.

As for the encoding you should actually use in the database, that almost 
certainly ought to be utf-8, which supports far more international 
characters than latin1.  And make sure the database has a way to tell 
the future user what encoding you picked.

> Any thoughts on a sort of generic method/means to handle any/all characters that might be out of range when having pulled them out of something like these MS access databases?

The only invalid characters are those which aren't valid in the encoding 
used.  Those can probably be safely converted to "?" or something similar.

>
> Another side note is for binary values that might store binary values, I use something like the following to generate hex-based strings that work alright when then inserting said same binary values into longblob fields, but, don't think this would really help for what are really just most likely badly chosen copy/pasted strings from documents, with strange encoding, or something:
> #sample code line for binary encoding into string output
> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\") + ", "
>

Best to not pretend they're text at all.  But if your db doesn't support 
binary blobs, then use an encoding which supports all 256 values 
unambiguously, while producing printable characters.  Like uuencod, 
using module uu

You might also look into mime, where you store the encoding of the data 
with the data.  See for example mimetypes.

-- 
DaveA