how to avoid leading white spaces

Chris Angelico rosuav at gmail.com
Thu Jun 2 23:52:03 EDT 2011


On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith <roy at panix.com> wrote:
> In article <is9ikg083h at news1.newsguy.com>,
>  Chris Torek <nospam at torek.net> wrote:
>
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).
>
> I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't
> unicode inherently 32 bit?  Or at least 20-something bit?  Things like
> UTF-16 are just one way to encode it.

The size of a Unicode character is like the size of a number. It's not
defined in terms of a maximum. However, Unicode planes 0-2 have all
the defined printable characters, and there are only 16 planes in
total, so (since each plane is 2^16 characters) that kinda makes
Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit
numbers to store a 20-bit number. Why do I get the feeling I've met
that before...

Chris Angelico
136E:0100 CD 20   INT 20



More information about the Python-list mailing list