String character encoding when converting data from one type/format to another

Wed Jan 7 08:32:00 EST 2015

Thanks.

Yes, using python 2.7, and all you said makes sense, so will check out the 
talk, and the byte'ing, etc. (yes, bad joke, I know)

Issue with knowing encoding could just be that am pretty sure at least some 
of the data capture is done via copy/paste from one MS app to another, which 
could possibly result in a whole bunch of different character sets, etc. 
being copied across, so it comes down to that while can't control sources of 
data, need to manipulate/work with it to make it useful on our side now.

Thanks again

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."

----- Original Message ----- 
From: "Ned Batchelder" <ned at nedbatchelder.com>
To: <python-list at python.org>
Sent: Wednesday, January 07, 2015 2:02 PM
Subject: Re: String character encoding when converting data from one 
type/format to another

> On 1/7/15 6:04 AM, Jacob Kruger wrote:
>> I'm busy using something like pyodbc to pull data out of MS access .mdb
>> files, and then generate .sql script files to execute against MySQL
>> databases using MySQLdb module, but, issue is forms of characters in
>> string values that don't fit inside the 0-127 range - current one seems
>> to be something like \xa3, and if I pass it through ord() function, it
>> comes out as character number 163.
>> Now issue is, yes, could just run through the hundreds of thousands of
>> characters in these resulting strings, and strip out any that are not
>> within the basic 0-127 range, but, that could result in corrupting data
>> - think so anyway.
>
> That will definitely corrupt your data, since you will be discarding data.
>
>> Anyway, issue is, for example, if I try something like
>> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
>> str('\xa3').encode('latin7') - that last one is actually our preferred
>> encoding for the MySQL database - they all just tell me they can't work
>> with a character out of range.
>
> Are you using Python 2 or Python 3? This is one area where the two are 
> very different.  I suspect you are on Python 2, in which case these all 
> fail the same way because you are calling encode on a bytestring.  You 
> can't encode a bytestring, you can only encode a Unicode string, so encode 
> is helpfully trying to decode your bytestring first, using the default 
> encoding (ascii), and '\xa3' is not an ascii character.
>
> If that was confusing, this talk covers these fundamentals: 
> http://bit.ly/unipain .
>
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
>
> The best thing is to know what encoding was used to produce these byte 
> values.  Then you can manipulate them as Unicode if you need to.  The 
> second best thing is to simply pass them through as bytes.
>
>> Another side note is for binary values that might store binary values, I
>> use something like the following to generate hex-based strings that work
>> alright when then inserting said same binary values into longblob
>> fields, but, don't think this would really help for what are really just
>> most likely badly chosen copy/pasted strings from documents, with
>> strange encoding, or something:
>> #sample code line for binary encoding into string output
>> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\")
>> + ", "
>> TIA
>>
>> Jacob Kruger
>> Blind Biker
>> Skype: BlindZA
>> "Roger Wilco wants to welcome you...to the space janitor's closet..."
>>
>>
>
>
> -- 
> Ned Batchelder, http://nedbatchelder.com
>
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>