String character encoding when converting data from one type/format to another

Chris Angelico rosuav at gmail.com
Wed Jan 7 07:20:16 EST 2015


On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder <ned at nedbatchelder.com> wrote:
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
>
>
> The best thing is to know what encoding was used to produce these byte
> values.  Then you can manipulate them as Unicode if you need to.  The second
> best thing is to simply pass them through as bytes.

If you can't know for sure, you could hazard a guess. There's a good
chance that an eight-bit encoding from a Microsoft product is CP-1252.
In fact, when I interoperate with Unicode-unaware Windows programs, I
usually attempt a UTF-8 decode, and if that fails, I simply assume
CP-1252; this generally gives correct results for data coming from
US-English Windows users.

Jacob, have a look at your data. Contextually, would the '\xa3' be
likely to be a pound sign, £? Would '\x85' make sense as an ellipsis?
Would \x90, \x91, \x92, and \x93 seem to be used for quote marks? If
so, CP-1252 would be the encoding to use.

ChrisA



More information about the Python-list mailing list