[Tutor] how to struct.pack a unicode string?

Steven D'Aprano steve at pearwood.info
Thu Jan 3 18:38:08 CET 2013


On 03/01/13 23:52, eryksun wrote:
> On Tue, Jan 1, 2013 at 1:29 AM, Steven D'Aprano<steve at pearwood.info>  wrote:
>>
>> 2 Since "wide builds" use so much extra memory for the average ASCII
>>    string, hardly anyone uses them.
>
> On Windows (and I think OS X, too) a narrow build has been practical
> since the wchar_t type is 16-bit. As to Linux I'm most familiar with
> Debian, which uses a wide build. Do you know off-hand which distros
> release a narrow build?

Centos, and presumably therefore Red Hat do. Fedora did, and I presume
still do.

I didn't actually realize until now that Debian defaults to a wide
build.


>> But more important than the memory savings, it means that for the first
>> time Python's handling of Unicode strings is correct for the entire range
>> of all one million plus characters, not just the first 65 thousand.
>
> Still, be careful not to split 'characters':
>
>      >>>  list(normalize('NFC', '\u1ebf'))
>      ['ế']
>      >>>  list(normalize('NFD', '\u1ebf'))
>      ['e', '̂', '́']


Yes, but presumably if you are normalizing to decomposed forms (NFD or NFKD
modes), you're doing it for a reason and are taking care not to let the
accents wander away from their base character, unless you want them to.

By the way, for anyone else trying this, the normalize function above is not
a built-in, it comes from the unicodedata module.

More on normalization:

https://en.wikipedia.org/wiki/Unicode_equivalence



Doing-a-lot-of-presuming-today-ly y'rs,


-- 
Steven


More information about the Tutor mailing list