How is unicode implemented behind the scenes?
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Sun Mar 9 04:39:32 EDT 2014
Le dimanche 9 mars 2014 03:40:28 UTC+1, MRAB a écrit :
> On 2014-03-09 02:08, Dan Stromberg wrote:
>
> > OK, I know that Unicode data is stored in an encoding on disk.
>
> >
>
> > But how is it stored in RAM?
>
> >
>
> > I realize I shouldn't write code that depends on any relevant
>
> > implementation details, but knowing some of the more common
>
> > implementation options would probably help build an intuition for
>
> > what's going on internally.
>
> >
>
> > I've heard that characters are no longer all c bytes wide internally,
>
> > so is it sometimes utf-8?
>
> >
>
> No.
>
>
>
> From Python 3.3, it's an array of 1, 2 or 4 bytes per codepoint.
>
>
>
> In Python terms:
>
>
>
> if all(c <= '\xFF' for c in string):
>
> use 1 byte per codepoint
>
> elif all(c <= '\xFFFF' for c in string):
>
> use 2 bytes per codepoint
>
> else:
>
> use 4 bytes per codepoint
A very, very nice recursive mathematical
absurdity.
jmf
More information about the Python-list
mailing list