hex dump w/ or w/out utf-8 chars

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Jul 13 05:36:06 EDT 2013


On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:

> I am convinced you are not conceptually understanding utf-8 very well. I
> wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
> Units".

Just because you write it many times, doesn't make it correct. You are 
simply wrong. UTF-8 produces bytes. That's what gets written to files and 
transmitted over networks, bytes, not "Unicode Encoding Units", whatever 
they are.


> A similar coding scheme: iso-6937 .
> 
> Try to write an editor, a text widget, with with a coding scheme like
> the Flexible String Represenation. You will quickly notice, it is
> impossible (understand correctly). (You do not need a computer, just a
> sheet of paper and a pencil) Hint: what is the character at the caret
> position?

That is a simple index operation into the buffer. If the caret position 
is 10 characters in, you index buffer[10-1] and it will give you the 
character to the left of the caret. buffer[10] will give you the 
character to the right of the caret. It is simple, trivial, and easy. The 
buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4 
bytes.

Here is an example of such a tiny buffer, implemented in Python 3.3 with 
the hated Flexible String Representation. In each example, imagine the 
caret is five characters from the left:

12345|more characters here...

It works regardless of whether your characters are ASCII:


py> buffer = '12345ABCD...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'A'


Latin 1:

py> buffer = '12345áßçð...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'á'


Other BMP characters:

py> buffer = '12345αдᚪ∞...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'α'


And Supplementary Plane Characters:

py> buffer = ('12345'
...     '\N{ALCHEMICAL SYMBOL FOR AIR}'
...     '\N{ALCHEMICAL SYMBOL FOR FIRE}'
...     '\N{ALCHEMICAL SYMBOL FOR EARTH}'
...     '\N{ALCHEMICAL SYMBOL FOR WATER}'
...     '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
12
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'🜁'
py> unicodedata.name(buffer[5])
'ALCHEMICAL SYMBOL FOR AIR'


And it all Just Works in Python 3.3. So much for "impossible to tell" 
what the character at the carat is. It is *trivial*.



Ah, but how about Python 3.2? We set up the same buffer:


py> buffer = ('12345'
...     '\N{ALCHEMICAL SYMBOL FOR AIR}'
...     '\N{ALCHEMICAL SYMBOL FOR FIRE}'
...     '\N{ALCHEMICAL SYMBOL FOR EARTH}'
...     '\N{ALCHEMICAL SYMBOL FOR WATER}'
...     '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
16

Sixteen? Sixteen? Where did the extra four characters come from? They 
came from *surrogate pairs*.


py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'\ud83d'


Funny, that looks different.


py> unicodedata.name(buffer[5])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name


No name?

Because buffer[5] is only *half* of the surrogate pair. It is broken, and 
there is really no way of fixing that breakage in Python 3.2 with a 
narrow build. You can fix it with a wide build, but only at the cost of 
every string, every name, using double the amount of storage, whether it 
needs it or not.


-- 
Steven




More information about the Python-list mailing list