How is unicode implemented behind the scenes?

Mark Lawrence breamoreboy at yahoo.co.uk
Sun Mar 9 10:53:05 EDT 2014


On 09/03/2014 10:32, Rustom Mody wrote:
> On Sunday, March 9, 2014 2:09:32 PM UTC+5:30, wxjm... at gmail.com wrote:
>> Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
>>> On 2014-03-09 02:08, Dan Stromberg wrote:
>>>> OK, I know that Unicode data is stored in an encoding on disk.
>>>> But how is it stored in RAM?
>>>> I realize I shouldn't write code that depends on any relevant
>>>> implementation details, but knowing some of the more common
>>>> implementation options would probably help build an intuition for
>>>> what's going on internally.
>>>> I've heard that characters are no longer all c bytes wide internally,
>>>> so is it sometimes utf-8?
>>> No.
>>>   From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
>>> In Python terms:
>>> if all(c <= '\xFF' for c in string):
>>>       use 1 byte per codepoint
>>> elif all(c <= '\xFFFF' for c in string):
>>>       use 2 bytes per codepoint
>>> else:
>>>       use 4 bytes per codepoint
>
>> A very, very nice recursive mathematical absurdity.
>
> As a profoundly astute mathematician
> v v n r m a
> can be parsed in 42 different ways (5th catalan number)
>
> Which parse did you intend?
>
>

Please don't feed this particular troll, it's a complete waste of time.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com





More information about the Python-list mailing list