[Python-ideas] string codes & substring equality

Thu Nov 28 07:52:31 CET 2013

On Nov 27, 2013, at 22:05, Chris Angelico <rosuav at gmail.com> wrote:

> On Thu, Nov 28, 2013 at 4:55 PM, Andrew Barnert 

>> Especially for Unicode, where a character isn't a byte, but an abstract code point that can be represented as at least three different variable-length sequences, taking up to 6 bytes.
> 
> No, a character is simply an integer. How it's represented is
> immaterial. The easiest representation in Python is a straight int,
> the easiest in C is probably also an int (32-bit; if it's 64-bit, you
> waste 40-odd bits, but it's still easiest); the variable length byte
> representations are for transmission/storage, not for manipulation

The easiest representation of a Unicode character is a Unicode string. It's certainly easiest for the person writing and debugging Python code, who can call string methods like isdigit,
print out the character or it's repr, etc. It's no harder for the person writing the Python implementation. If you mean easiest for the CPU, do you really think creating and dealing with arbitrary-length integers wrapped in structs with PyObject headers is easier than dealing with strings of 1/2/4-byte characters wrapped in structs with PyObject headers?