Convert a list with wrong encoding to utf8

Calvin Spealman cspealma at redhat.com
Thu Feb 14 13:16:12 EST 2019


If you see something like this

'\xce\x86\xce\xba\xce\xb7\xcf\x82
\xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'

then you don't have a string, you have raw bytes. You don't "encode" bytes,
you decode them. If you know this is already encoded as UTF-8 then you just
need the decode('utf8') part and *not* the encode('latin1') step.

encode() is something that turns text into bytes
decode() is something that turns bytes into text

So, if you already have bytes and you need text, you should only want to be
doing a decode() and you just need to specific the correct encoding.

On Thu, Feb 14, 2019 at 12:15 PM <vergos.nikolas at gmail.com> wrote:

> Τη Πέμπτη, 14 Φεβρουαρίου 2019 - 6:45:29 μ.μ. UTC+2, ο χρήστης Calvin
> Spealman έγραψε:
> > You can only decode FROM the same encoding you've encoded TO. Any
> decoding
> > must know the input it receives follows the rules of its encoding scheme.
> > latin1 is not utf8.
> >
> > However, in your case, you aren't seeing problem with  the decoding. That
> > step is never reached. It is failing to encode the string as latin1
> because
> > it is not compatible with the latin1 scheme. Your string contains
> > characters which cannot be represented in latin1.
> >
> > It really is not clear what you're trying to accomplish here. The string
> > encoding was already handled when you pulled this out of the database and
> > you should not need to do anything like this at all. You already have a
> > decoded string, because in python ALL strings are decoded already.
> Encoding
> > is only a process of converting strings to raw bytes for storage or
> > transmission, which you don't appear to be doing here.
>
> Names in database are stored in utf8
> When the script runs it reads them and handles them as utf8, right?
>
> If it like this, then why when i print 'names' list i see bytes in
> hexadecimal format?
>
> '\xce\x86\xce\xba\xce\xb7\xcf\x82
> \xce\xa4\xcf\x83\xce\xb9\xce\xac\xce\xbc\xce\xb7\xcf\x82'
>
> And only if i
>
> for name in names:
>     print( name.encode('latin1').decode('utf8') )
>
> i can see the values of 'name' list correctly in Greek.
>
> But where did the latin-iso took in place? And aparrt for printing the
> name like above how can i store them in proper utf ?
> --
> https://mail.python.org/mailman/listinfo/python-list
>


-- 

CALVIN SPEALMAN

SENIOR QUALITY ENGINEER

cspealma at redhat.com  M: +1.336.210.5107
<https://red.ht/sig>
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>



More information about the Python-list mailing list