Python NBSP DWIM

Steven D'Aprano steve at pearwood.info
Wed Jun 10 22:26:05 EDT 2015


On Thu, 11 Jun 2015 10:09 am, Chris Angelico wrote:

> On Thu, Jun 11, 2015 at 3:11 AM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> (Oh, and for the record, there are at least two non-breaking spaces in
>> Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
>>
>> http://www.unicode.org/charts/PDF/U0080.pdf
>> http://www.unicode.org/charts/PDF/U2000.pdf
> 
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE", 

No, despite the name, that is not a space character, it is a formatting
character. Due to Unicode's stability policy, the name is stuck forever,
but it should not be treated as a space character:

py> unicodedata.category(' ')
'Zs'
py> unicodedata.category('\u00A0')  # NBSP
'Zs'
py> unicodedata.category('\uFEFF')  # ZWNBSP
'Cf'


Ideally, outside of the BOM, you should never come across a ZWNBSP. You
should use U+2060 WORD JOINER instead. But if you do come across one
outside of the BOM, it should be treated as a legitimate non-space
character:

http://www.unicode.org/faq/utf_bom.html#bom6

Although ZWNBSP is a "default ignorable" code point, I believe that the font
is well within its rights to show it with a visible glyph:

    "Fonts can contain glyphs intended for visible display of 
    default ignorable code points that would otherwise be 
    rendered invisibly when not supported."

http://www.unicode.org/faq/unsup_char.html


> notable because it's also used as 
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've 
> been fighting with VLC Media Player over the font it uses for subtitles;
> for some bizarre reason, that font represents U+FEFF not with zero pixels
> of emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines. Yeah, because that totally takes up zero width and looks like blank
> space.

Why do the subtitles contain ZWNBSP in the first place? Surely they're not
English subtitles?


-- 
Steven




More information about the Python-list mailing list