How to get a "screen" length of a multibyte string?

Sun Nov 25 08:30:24 EST 2012

On Sun, 25 Nov 2012 22:12:33 +1100, Chris Angelico wrote:

> On Sun, Nov 25, 2012 at 9:19 PM, kobayashi <pg.koba at gmail.com> wrote:
>> Hello,
>>
>> Under platform that has fixed pitch font, I want to get a "screen"
>> length of a multibyte string
>>
>> --- sample ---
>> s1 = u"abcdef"
>> s2 = u"あいう" # It has same "screen" length as s1's. print len(s1)  # Got
>> 6
>> print len(s2)  # Got 3, but I want get 6. --------------
>>
>> Abobe can get a "character" length of a multibyte string. Is there a
>> way to get a "screen" length of a multibyte string?
> 
> What do you mean by screen length? Do you mean the length in bytes? That
> depends on your encoding. Do you mean width of the displayed version?
> That depends on your font.

That's what I thought, but on doing some experimentation in my terminal, 
and doing some googling, I have come to the understanding that so-called 
monospaced (fixed-width) fonts may support *double column* characters as 
well as single column.

So the OP's example has:

s1 = u"abcdef"
s2 = u"あいう"

s1 has six single-column ("narrow") characters, while s2 has three double-
column ("wide") characters, and both strings should take up the same 
horizontal space on screen.

If you are reading this in a non-monospaced font, the width of each 
character is not fixed, the idea of columns doesn't really work, and the 
strings may not be the same width.

See http://www.unicode.org/reports/tr11/tr11-19.html for more detail.

Interestingly, Unicode supports wide versions of many non-EastAsian 
characters (presumably because pre-Unicode EastAsian encodings supported 
them). For example, run this code in Python:

print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA'

which should output:

Ａ
AA

If your font supports this, you should see a single "A" as wide as the 
double "AA" beneath it. 

Curiously, in the monospaced font I am using to type this, the 
"fullwidth" (wide, two-column) A is actually 2/3rds the width of the 
standard ("halfwidth", narrow, one-column) A. Font designers -- can't 
live with them, can't take them out and shoot them.

Hans Mulder's suggestion:

from unicodedata import east_asian_width

def screen_length(s):
    return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)

is almost right. The Unicode document above states:

[quote]
In a broad sense, wide characters include W, F, and A (when in East Asian 
context), and narrow characters include N, Na, H, and A (when not in East 
Asian context).
[end quote]

from unicodedata import east_asian_width
def columns(s, eastasian_context=True):
    if eastasian_context:
        wide = 'WFA'
    else:
        wide = 'WF'
    return sum(2 if east_asian_width(c) in wide else 1 for c in s)

ought to do it for all but the most sophisticated text layout 
applications. For those needing much more sophistication, see here:

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

-- 
Steven