Re: a little parsing challenge ☺

Tue Jul 19 01:36:33 EDT 2011

On Tue, Jul 19, 2011 at 2:59 PM, rusi <rustompmody at gmail.com> wrote:
> Some evidences of leakiness:
> code point vs character vs byte
> encoding and decoding
> UTF-x and UCS-y
>
> Very important and necessary distinctions? Maybe... But I did not need
> them when my world was built of the 127 bricks of ASCII.

Codepoint vs byte is NOT an abstraction. Unicode consists of
characters, where each character is represented by a number called its
codepoint. Since computers work with bytes, we need a way of encoding
those characters into bytes. It's no different from encoding a piece
of music in bytes, and having it come out as 0x90 0x64 0x40. Are those
bytes an abstraction of the note? No. They're an encoding of a MIDI
message that requests that the note be struck. The note itself is an
abstraction, if you like; but the bytes to create that note could be
delivered in a variety of other ways.

A Python Unicode string, whether it's Python 2's 'unicode' or Python
3's 'str', is a sequence of characters. Since those characters are
stored in memory, they must be encoded somehow, but that's not our
problem. We need only care about encoding when we save those
characters to disk, transmit them across the network, or in some other
way need to store them as bytes. Otherwise, there is no abstraction,
and no leak.

Chris Angelico