a simple unicode question

Nobody nobody at nowhere.com
Wed Oct 21 12:35:11 EDT 2009


On Wed, 21 Oct 2009 05:16:56 -0400, Chris Jones wrote:

>> > Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? 
>> 
>> You can get them from the unicodedata module, e.g.:
>> 
>> 	import unicodedata
>> 	for i in xrange(0x10000):
>> 	  n = unicodedata.name(unichr(i),None)
>> 	  if n is not None:
>> 	    print i, n
> 
> Python rocks!
> 
> Just curious, why did you choose to set the upper boundary at 0xffff?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

	Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
	win32
	>>> unichr(0x10000)
	Traceback (most recent call last):
	  File "<stdin>", line 1, in <module>
	ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

	>>> u'\N{LINEAR B SYLLABLE B008 A}'
	u'\U00010000'
	>>> len(_)
	2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?




More information about the Python-list mailing list