hex dump w/ or w/out utf-8 chars

Dave Angel davea at davea.name
Mon Jul 8 16:56:54 EDT 2013


On 07/08/2013 01:53 PM, ferdy.blatsco at gmail.com wrote:
> Hi Steven,
>
> thank you for your reply... I really needed another python guru which
> is also an English teacher! Sorry if English is not my mother tongue...
> "uncorrect" instead of "incorrect" (I misapplied the "similarity
> principle" like "unpleasant...>...uncorrect").
>
> Apart from these trifles, you said:
>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
> Not using python 3, for me (a programmer which was present at the beginning of
> computer science, badly interacting with many languages from assembler to
> Fortran and from c to Pascal and so on) it was an hard job to arrange the
> abrupt transition from characters only equal to bytes to some special
> characters defined with 2, 3 bytes and even more.

Characters do not have a width.  They are Unicode code points, an 
abstraction.  It's only when you encode them in byte strings that a code 
point takes on any specific width.  And some encodings go to one-byte 
strings (and get errors for characters that don't match), some go to 
two-bytes each, some variable, etc.

> I should have preferred another solution... but i'm not Guido....!

But Unicode has nothing to do with Guido, and it has existed for about 
25 years (if I recall correctly).  It's only that Python 3 is finally 
embracing it, and making it the default type for characters, as it 
should be.  As far as I'm concerned, the only reason it shouldn't have 
been done long ago was that programs were trying to fit on 640k DOS 
machines.  Even before Unicode, there were multi-byte encodings around 
(eg. Microsoft's MBCS), and each was thoroughly incompatible with all 
the others.  And the problem with one-byte encodings is that if you need 
to use a Greek currency symbol in a document that's mostly Norwegian (or 
some such combination of characters), there might not be ANY valid way 
to do it within a single "character set."

Python 2 supports all the same Unicode features as 3;  it's just that it 
defaults to byte strings.  So it's HARDER to get it right.

Except for special purpose programs like a file dumper, it's usually 
unnecessary for a Python 3 programmer to deal with individual bytes from 
a byte string.  Text files are a bunch of bytes, and somebody has to 
interpret them as characters.  If you let open() handle it, and if you 
give it the correct encoding, it just works.  Internally, all strings 
are Unicode, and you don't care where they came from, or what human 
language they may have characters from.  You can combine strings from 
multiple places, without much worry that they might interfere.


Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS) 
from the beginning (approx 1992), and has had Unicode versions of each 
of its API's for nearly as long.

I appreciate you've been around a long time, and worked in a lot of 
languages.  I've programmed professionally in at least 35 languages 
since 1967.  But we've come a long way from the 6bit characters I used 
in 1968.  At that time, we packed them 10 characters to each word.

-- 
DaveA




More information about the Python-list mailing list