[Tutor] encodings

Mon Jun 16 13:52:13 2003

Denis,

I run a Debian 3.0 system with python 2.2 and need to handle cyrillic
character conversion between koi8-r and cp1251 to generate some html as
well. Let's see if I can make it clearer for you as it works for me just
fine.

* Denis Dzyubenko <shad@mail.kubtelecom.ru> [2003-06-16 16:19:01 +0400]:

> On Sat, 14 Jun 2003 23:30:20 +0200,
>  Magnus Lyck(ML) wrote to me:
>=20
> ML> Clearer now?
>=20
> Yes, now it is clear.
>=20
> ML> Use type() to check what type you have in each situation.
>=20
> >> >>> s =3D u"abc=C1=C2=D7"
> >> >>> s.encode('cp1251')
> >>Traceback (most recent call last):
> >>   File "<stdin>", line 1, in ?
> >>   File "/usr/lib/python2.1/encodings/cp1251.py", line 18, in encode
> >>     return codecs.charmap_encode(input,errors,encoding_map)
> >>UnicodeError: charmap encoding error: character maps to <undefined>
>=20
> ML> That means that your unicode string contains values that CP1251
> ML> can't present. Does "print s" produce the output you would expect?
>=20
> no, 'print s' prodices error:
> 'UnicodeError: ASCII encoding error: ordinal not in range(128)'

Denis. If you are coming from s =3D u"abc=C1=C2=D7" instead of=20
sk =3D 'abc=C1=C2=D7'; usk =3D sk.decode('koi8-r'); s =3D usd.encode('koi=
8-r') (or
'cp1251'), then use "print s.encode('iso8859_15')"

It is easier, though (at least to me) to come from koi8-r, then do that
straightforward conversion through unicode to cp1251 as Magnus already
pointed out for you.

Just to reiterate (tabbed line shows output):

koistr =3D '=D0=D2=CF=D7=C5=D2=CB=C1'
koistr
    '\xd0\xd2\xcf\xd7\xc5\xd2\xcb\xc1'
print koistr
    =D0=D2=CF=D7=C5=D2=CB=C1
ukoistr =3D koistr.decode('koi8-r')
ukoistr
    u'\u043f\u0440\u043e\u0432\u0435\u0440\u043a\u0430'
print ukoistr
    UnicodeError: ASCII encoding error: ordinal not in range(128)
koikoistr =3D ukoistr.encode('koi8-r')
koikoistr
    '\xd0\xd2\xcf\xd7\xc5\xd2\xcb\xc1'
print koikoistr
    =D0=D2=CF=D7=C5=D2=CB=C1
cpstr =3D ukoistr.encode('cp1251')
cpstr
    '\xef\xf0\xee\xe2\xe5\xf0\xea\xe0'
print cpstr
    =EF=F0=EE=E2=E5=F0=EA=E0
cat >> test
=EF=F0=EE=E2=E5=F0=EA=E0
<Ctrl-D>
konwert cp1251-koi8r test
    =D0=D2=CF=D7=C5=D2=CB=C1

If you have any problems following this path please write to me
directly. It is probably not worth the firepower of this list to keep
iterating over such simple matter.

Regards,

Alex.

> ML> How does it look if you do "print s"? Does it look like cyrillic?
> ML> what about "print repr(s)". Are all values in the correct range?
>=20
> now, values are not listed in the link you gave me
> (http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-cyr1.ent)
>=20
> >>ML> txt.decode('koi8-r').encode('cp1251')
> >>
> >> >>> txt.decode('koi8-r')
> >>Traceback (most recent call last):
> >>   File "<stdin>", line 1, in ?
> >>AttributeError: decode
>=20
> ML> That means that txt is not an object of type string. If it's
>=20
> >>> txt =3D "=C1=C2=D7"
> >>> type(txt)
> <type 'string'>
> >>> txt.decode("koi8-r")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> AttributeError: decode
>=20
> and dir(txt) doesn't contain attribute 'decode'
>=20
> ML> Look here:
>=20
>  >>>> u =3D u'\u042F\u042B\u042C'
>=20
> ML> Now we have a unicode representaion with three
> ML> cyrillic letters. You should be able to do
> ML> "print u" and see something reasonable. I start
>=20
> no, I can't see anything reasonable:
>=20
> >>> u =3D u'\u042F\u042B\u042C'
> >>> u
> u'\u042f\u042b\u042c'
> >>> print u
>=20
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>=20
> ML> If you still have problems, look at the error handling issues
> ML> I wrote about.
>=20
> sorry, I still can't understand source of my problems :(
>=20
> --=20
> Denis.
>=20
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>=20