[Tutor] sort() method and non-ASCII

Cameron Simpson cs at zip.com.au
Sun Feb 5 23:49:44 EST 2017


On 05Feb2017 22:27, boB Stepp <robertvstepp at gmail.com> wrote:
>On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> Alternatively, you can embed it right in the string. For code points
>> between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U
>> escapes:
>>
>> py> 'pi = \u03C0'  # requires exactly four hex digits
>> 'pi = π'
>>
>> py> 'pi = \U000003C0'  # requires exactly eight hex digits
>> 'pi = π'
>>
>>
>> Lastly, you can use the code point's name:
>>
>> py> 'pi = \N{GREEK SMALL LETTER PI}'
>> 'pi = π'
>
>You have surprised me here by using single quotes to enclose the
>entire assignment statements.  I thought this would throw a syntax
>error, but it works just like you show.  What is going on here?

It's not an assignment statement. It's just a string. He's typing a string 
containing a \N{...} sequence and Python's printing that string back at you; 
pi's a printable character and gets displayed directly.

Try with this:

  py> 'here is a string\n\nline 3'

>> One last comment: Random832 said:
>> "Python 3 strings are unicode-unicode, not UTF-8."
>
>If I recall what I originally wrote (and intended) I was merely
>indicating I was happy with Python 3's default UTF-8 encoding.  I do
>not know enough to know what these other UTF encodings offer.

From the outside (i.e. to your code) Python 3 strings are sequences of Unicode 
code points (characters, near enough). How they're _stored_ internally is not 
your problem:-) When you write a string to a file or the terminal etc, the 
string needs to be _encoded_ into a sequence of bytes (a sequence of bytes 
because there are more Unicode code points than can be expressed with one 
byte).

UTF-8 is by far the commonest such encoding in use. It has several nice 
characteristics: for one, the ASCII code points _are_ stored in a single byte.  
While that's nice for Western almost-only-speaking-English folks like me, it 
also means that the zillions of extisting ASCII text files don't need to be 
recoded to work in UTF-8. It has other cool features too.

>> To be pedantic, Unicode strings are sequences of abstract code points
>> ("characters"). UTF-8 is a particular concrete implementation that is
>> used to store or transmit such code strings. Here are examples of three
>> possible encoding forms for the string 'πz':
>>
>> UTF-16: either two, or four, bytes per character: 03C0 007A
>>
>> UTF-32: exactly four bytes per character: 000003C0 0000007A
>>
>> UTF-8: between one and four bytes per character: CF80 7A
>
>I have not tallied up how many code points are actually assigned to
>characters.  Does UTF-8 encoding currently cover all of them?  If yes,
>why is there a need for other encodings?  Or by saying:

UTF-8 is variable length. You can leap into the middle of a UTF-8 string and 
resync (== find the first byte of the next character) thanks to its neat coding 
design, but you can't "seek" directly to the position of an arbitrarily 
numbered character (eg go to character 102345). By contract, UTF-32 is fixed 
length.

>> (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
>> reversed, e.g. C003 7A00. UTF-8 is not.)
>
>do you mean that some hardware configurations require UTF-16 or UTF-32?

No, different machines order the bytes in a larger word in different orders.  
"Big endian" machines like SPARCs and M68k etc put the most significant bytes 
first; little endian machines put the least significant bytes first (eg Intel 
architecture machines). (Aside: the Alpha was switchable.)

So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, 
and you need to know what was chosen for a given file.

Cheers,
Cameron Simpson <cs at zip.com.au>


More information about the Tutor mailing list