How to get a "screen" length of a multibyte string?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Nov 25 08:30:24 EST 2012
On Sun, 25 Nov 2012 22:12:33 +1100, Chris Angelico wrote:
> On Sun, Nov 25, 2012 at 9:19 PM, kobayashi <pg.koba at gmail.com> wrote:
>> Hello,
>>
>> Under platform that has fixed pitch font, I want to get a "screen"
>> length of a multibyte string
>>
>> --- sample ---
>> s1 = u"abcdef"
>> s2 = u"あいう" # It has same "screen" length as s1's. print len(s1) # Got
>> 6
>> print len(s2) # Got 3, but I want get 6. --------------
>>
>> Abobe can get a "character" length of a multibyte string. Is there a
>> way to get a "screen" length of a multibyte string?
>
> What do you mean by screen length? Do you mean the length in bytes? That
> depends on your encoding. Do you mean width of the displayed version?
> That depends on your font.
That's what I thought, but on doing some experimentation in my terminal,
and doing some googling, I have come to the understanding that so-called
monospaced (fixed-width) fonts may support *double column* characters as
well as single column.
So the OP's example has:
s1 = u"abcdef"
s2 = u"あいう"
s1 has six single-column ("narrow") characters, while s2 has three double-
column ("wide") characters, and both strings should take up the same
horizontal space on screen.
If you are reading this in a non-monospaced font, the width of each
character is not fixed, the idea of columns doesn't really work, and the
strings may not be the same width.
See http://www.unicode.org/reports/tr11/tr11-19.html for more detail.
Interestingly, Unicode supports wide versions of many non-EastAsian
characters (presumably because pre-Unicode EastAsian encodings supported
them). For example, run this code in Python:
print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA'
which should output:
A
AA
If your font supports this, you should see a single "A" as wide as the
double "AA" beneath it.
Curiously, in the monospaced font I am using to type this, the
"fullwidth" (wide, two-column) A is actually 2/3rds the width of the
standard ("halfwidth", narrow, one-column) A. Font designers -- can't
live with them, can't take them out and shoot them.
Hans Mulder's suggestion:
from unicodedata import east_asian_width
def screen_length(s):
return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)
is almost right. The Unicode document above states:
[quote]
In a broad sense, wide characters include W, F, and A (when in East Asian
context), and narrow characters include N, Na, H, and A (when not in East
Asian context).
[end quote]
from unicodedata import east_asian_width
def columns(s, eastasian_context=True):
if eastasian_context:
wide = 'WFA'
else:
wide = 'WF'
return sum(2 if east_asian_width(c) in wide else 1 for c in s)
ought to do it for all but the most sophisticated text layout
applications. For those needing much more sophistication, see here:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
--
Steven
More information about the Python-list
mailing list