Convert a list with wrong encoding to utf8

MRAB python at mrabarnett.plus.com
Thu Feb 14 13:56:04 EST 2019


On 2019-02-14 18:16, Calvin Spealman wrote:
> If you see something like this
> 
> '\xce\x86\xce\xba\xce\xb7\xcf\x82
> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
> 
> then you don't have a string, you have raw bytes. You don't "encode" bytes,
> you decode them. If you know this is already encoded as UTF-8 then you just
> need the decode('utf8') part and *not* the encode('latin1') step.
> 
> encode() is something that turns text into bytes
> decode() is something that turns bytes into text
> 
> So, if you already have bytes and you need text, you should only want to be
> doing a decode() and you just need to specific the correct encoding.
> 
It doesn't have a 'b' prefix, so either it's Python 2 or it's a Unicode 
string that was decoded wrongly from the bytes.

> On Thu, Feb 14, 2019 at 12:15 PM <vergos.nikolas at gmail.com> wrote:
> 
>> Τη Πέμπτη, 14 Φεβρουαρίου 2019 - 6:45:29 μ.μ. UTC+2, ο χρήστης Calvin
>> Spealman έγραψε:
>> > You can only decode FROM the same encoding you've encoded TO. Any
>> decoding
>> > must know the input it receives follows the rules of its encoding scheme.
>> > latin1 is not utf8.
>> >
>> > However, in your case, you aren't seeing problem with  the decoding. That
>> > step is never reached. It is failing to encode the string as latin1
>> because
>> > it is not compatible with the latin1 scheme. Your string contains
>> > characters which cannot be represented in latin1.
>> >
>> > It really is not clear what you're trying to accomplish here. The string
>> > encoding was already handled when you pulled this out of the database and
>> > you should not need to do anything like this at all. You already have a
>> > decoded string, because in python ALL strings are decoded already.
>> Encoding
>> > is only a process of converting strings to raw bytes for storage or
>> > transmission, which you don't appear to be doing here.
>>
>> Names in database are stored in utf8
>> When the script runs it reads them and handles them as utf8, right?
>>
>> If it like this, then why when i print 'names' list i see bytes in
>> hexadecimal format?
>>
>> '\xce\x86\xce\xba\xce\xb7\xcf\x82
>> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
>>
>> And only if i
>>
>> for name in names:
>>     print( name.encode('latin1').decode('utf8') )
>>
>> i can see the values of 'name' list correctly in Greek.
>>
>> But where did the latin-iso took in place? And aparrt for printing the
>> name like above how can i store them in proper utf ?
>>



More information about the Python-list mailing list