How to get an encoding a value?

Diez B. Roggisch deets.nospaaam at web.de
Fri Oct 22 12:04:23 EDT 2004


> I have a some data is encoded into something thing. I want to find out the
> encoding of that piece of data. For example s = u"somedata"
> I want to do something like
> ThisIsTheEncodingOfS =kc  s.getencoding()
> 
> is there are method that tell me that it is unicode value if I provide it
> with a unicode string?

You are confusing unicode with strings with a certain encoding.

Unicode is an abstract specification of a huge number of characters,
hopefully covering even the close-to-unknown glyphs of some ancient
himalayan mountain tribe to the commonly used latin alphabet. There are no
actual numeric values associated with that glyphs.

An encoding on the other hand maps  certain sets of glyphs to actual numbers
- e.g. the subset of common european language glyphs commonly known as
iso-8859-1, and much more - including utf-8, an encoding thats capable of
encoding all glyphs specified in unicode, at the cost of possibly using
more than one byte per glyph.

Now if you have a unicode object u, you can _encode_ it in a certain
encoding like this:

u.encode("utf-8")

If you oth have a string s of known encoding, you can decode it to a
unicode-object like this:

s.decode("latin1")

Thats the basics. Now to your actual question: your example makes no sense,
as you have a unicodeobject - which lacks any encoding whatsoever. And
unfortunately, if you have a string instead of an unicode object, you can
only guess what  encoding it has - if you are lucky, that works. But no one
can guarantee that it works out - neither in python, nor in other
programming languages.

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
    try:
        if s == s.decode(e).encode(e):
              break
    except UnicodeError:
        pass


-- 
Regards,

Diez B. Roggisch



More information about the Python-list mailing list