hex dump w/ or w/out utf-8 chars

Chris Angelico rosuav at gmail.com
Mon Jul 8 14:07:22 EDT 2013


On Tue, Jul 9, 2013 at 3:53 AM,  <ferdy.blatsco at gmail.com> wrote:
>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
> Not using python 3, for me (a programmer which was present at the beginning of
> computer science, badly interacting with many languages from assembler to
> Fortran and from c to Pascal and so on) it was an hard job to arrange the
> abrupt transition from characters only equal to bytes to some special
> characters defined with 2, 3 bytes and even more.

Even back then, bytes and characters were different. 'A' is a
character, 0x41 is a byte. And they correspond 1:1 if and only if you
know that your characters are represented in ASCII. Other encodings
(eg EBCDIC) mapped things differently. The only difference now is that
more people are becoming aware that there are more than 256 characters
in the world.

Like Magic 2014 and its treatment of Slivers, at some point you're
going to have to master the difference between bytes and characters,
or else be eternally hacking around stuff in your code, so now is as
good a time as any.

ChrisA



More information about the Python-list mailing list