usage of <string>.encode('utf-8','xmlcharrefreplace')?

Tue Feb 19 03:42:37 EST 2008

On Feb 19, 12:15 am, J Peyret <jpey... at gmail.com> wrote:
> On Feb 18, 10:54 pm, 7stud <bbxx789_0... at yahoo.com> wrote:
>
> > One last point: you can't display a unicode string.  The very act of
> > trying to print a unicode string causes it to be converted to a
> > regular string.  If you try to display a unicode string without
> > explicitly encode()'ing it first, i.e. converting it to a regular
> > string using a specified secret code--a so called 'codec', python will
> > implicitly attempt to convert the unicode string to a regular string
> > using the default codec, which is usually set to ascii.
>
> Yes, the string above was obtained by printing, which got it into
> ASCII format, as you picked up.
> Something else to watch out for when posting unicode issues.
>
> The solution I ended up with was
>
> 1) Find out the encoding in the data file.
>
> In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
> the bottom of the save prompt dialog.
>
> ISO-8859-15 in my case.
>
> 2) Look up encoding corresponding to ISO-8859-15 at
>
> http://docs.python.org/lib/standard-encodings.html
>
> 3) Applying the decode/encode recipe suggested previously, for which I
> do understand the reason now.
>
> #converting rawdescr
> #from ISO-8859-15 (from the file)
> #to UTF-8 (what postgresql wants)
> #no error handler required.
> decodeddescr = rawdescr.decode('iso8859_15').encode('utf-8')
>
> postgresql insert is done using decodeddescr variable.
>
> Postgresql is happy, I'm happy.

Or, you can cheat.  If you are reading from a file, you can make set
it up so any string that you read from the file automatically gets
converted from its encoding to another encoding.  You don't even have
to be aware of the fact that a regular string has to be converted into
a unicode string before it can be converted to a regular string with a
different encoding. Check out the codecs module and the EncodedFile()
function:

import codecs

s = 'he Company\xef\xbf\xbds ticker'

f = open('data2.txt', 'w')
f.write(s)
f.close()

f = open('data2.txt')
f_special = codecs.EncodedFile(f, 'utf-8', 'iso8859_15')  #file, new
encoding, file's encoding
print f_special.read()  #If your display device understands utf-8, you
will see the troublesome character displayed.
                        #Are you sure that character is legitimate?

f.close()
f_special.close()