UTF-8 and stdin/stdout?

Ulrich Eckhardt eckhardt at satorlaser.com
Wed May 28 06:25:23 EDT 2008


Chris wrote:
> On May 28, 11:08 am, dave_140... at hotmail.com wrote:
>> Say I have a file, utf8_input, that contains a single character, é,
>> coded as UTF-8:
>>
>> $ hexdump -C utf8_input
>> 00000000  c3 a9
>> 00000002
[...]
> weird thing is 'c3 a9' is é on my side... and copy/pasting the é
> gives me 'e9' with the first script giving a result of zero and second
> script gives me 1

Don't worry, it can be that those are equivalent. The point is that some
characters exist more than once and some exist in a composite form (e with
accent) and separately (e and combining accent).

Looking at http://unicode.org/charts I see that the letter above should have
codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

0xe9 = 1110 1001 (codepoint)
0xc3 0xa9 = 1100 0011  1010 1001 (UTF-8)

Anyhow, further looking at this shows that your editor simply doesn't
interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
they represent the capital A with tilde and the copyrigth sign.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932




More information about the Python-list mailing list