how to find out utf or not

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Nov 5 11:39:53 EST 2013


On Tue, 05 Nov 2013 16:32:57 +0330, Mohsen Pahlevanzadeh wrote:

> Dear all,
> 
> Suppose i have a variable such as : myVar = 'x'
> 
> May be it initialized with myVar = u'x' or myVar = 'x'

Can't you just look at the code and tell which it is?


> So i need determine content of myVar that it's utf-8 or not, how can i
> do it?

I think you misunderstand the difference between Unicode and UTF-8. The 
first thing you must understand is that Unicode does not mean UTF-8. They 
are different things. Anyone who has told you that they are the same is 
also mistaken.

Unicode is an abstract collection of characters, a character set. 
(Technically, code points rather than characters, but don't worry about 
that yet.) In Python 2, you normally create a Unicode string with either 
the u"..." literal syntax, or the unicode() function. A Unicode string 
might look like this:

    abc§жπxyz

Each character has an ordinal value, which is the same as its Unicode 
code point:

py> s = u'abc§жπxyz'
py> for char in s:
...     print char, ord(char)
...
a 97
b 98
c 99
§ 167
ж 1078
π 960
x 120
y 121
z 122


Note that ordinal values go *far* beyond 256. They go from 0 to 1114111. 
So a Unicode string is a string of code points, in this example:

97 98 99 167 1078 960 120 121 122

Of course, computers don't understand "code points" any more than they 
understand "sounds" or "movies" or "pictures of cats". Computers only 
understand *bytes*. So how are these code points represented as bytes? By 
using an encoding -- an encoding tells the computer how to represent 
characters like "a", "b" and "ж" as bytes, for storage on disk or in 
memory.

There are at least six different encodings for Unicode strings, and UTF-8 
is only one of them. The others are two varieties each of UTF-16 and 
UTF-32, and UTF-7. Given the unicode string:

u'abc§жπxyz'

it could be stored in memory as any of these sequences of hexadecimal 
bytes:

610062006300A7003604C003780079007A00

00610062006300A7043603C000780079007A

610000006200000063000000A700000036040000C003000078000000790000007A000000

000000610000006200000063000000A700000436000003C000000078000000790000007A

616263C2A7D0B6CF8078797A

6162632B414B63454E6750412D78797A


and likely others as well. Which one will Python use? That depends on the 
version of Python, how the interpreter was built, what operating system 
you are using, and various other factors. Without knowing lots of 
technical detail about your specific Python interpreter, I can't tell 
which encoding it will be using internally. But I can be pretty sure that 
it isn't using UTF-8.

So, you have a variable. Perhaps it has been passed to you from another 
function, and you need to find out what it is. In this case, you do the 
same thing you would do for any other type (int, list, dict, str, ...) 
and use isinstance:

if isinstance(myVar, unicode):
    ...


If myVar is a Unicode string, you don't need to care about the encoding 
(UTF-8 or otherwise) until you're ready to write it to a file. Then I 
strongly recommend you always use UTF-8, unless you have to interoperate 
with some old, legacy system:

assert isinstance(myVar, unicode)
byte_string = myVar.encode('utf-8')


will return a byte-string encoded using UTF-8.

If myVar is a byte-string, like 'abc' without the u'' prefix, then you 
have a bit of a problem. Think of it like a file without a file 
extension: it could be a JPEG, a WAV, a DLL, anything. There's no real 
way to be sure. You can look inside the file and try to guess, but that's 
not always reliable. Without the extension "myfile.jpg", "myfile.wav", 
etc. you can't tell for sure what "myfile" is (although sometimes you can 
make a good prediction: "my holiday picture" is probably a JPEG.

And so it is with byte-strings. Unless you know where they came from and 
how they were prepared, you can't easily tell what encoding they used, at 
least not without guessing. But if you control the source of the data, 
and make sure you only use the encoding of your choice (let's say UTF-8), 
then it is easy to convert the bytes into Unicode:

assert isinstance(myVar, str)
unicode_string = myVar.decode('utf-8')



-- 
Steven



More information about the Python-list mailing list