String is ASCII or UTF-8?

Tue Mar 9 12:36:48 EST 2010

On 2010-03-09 11:12 AM, Stef Mientki wrote:
> On 09-03-2010 18:02, Alf P. Steinbach wrote:
>> * C. Benson Manica:
>>> Hours of Googling has not helped me resolve a seemingly simple
>>> question - Given a string s, how can I tell whether it's ascii (and
>>> thus 1 byte per character) or UTF-8 (and two bytes per character)?
>>> This is python 2.4.3, so I don't have getsizeof available to me.
>>
>> Generally, if you need 100% certainty then you can't tell the encoding
>> from a sequence of byte values.
>>
>> However, if you know that it's EITHER ascii or utf-8 then the presence
>> of any value above 127 (or, for signed byte values, any negative
>> values), tells you that it can't be ascii,
> AFAIK it's completely impossible.
> UTF-8 characters have 1 to 4 bytes / byte.
> I can create ASCII strings containing byte values between 127 and 255.

No, you can't. ASCII strings only have characters in the range 0..127. You could 
create Latin-1 (or any number of the 8-bit encodings out there) strings with 
characters 0..255, yes, but not ASCII.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco