Python NBSP DWIM

random832 at fastmail.us random832 at fastmail.us
Wed Jun 10 21:02:16 EDT 2015


On Wed, Jun 10, 2015, at 20:09, Chris Angelico wrote:
> And U+FEFF "ZERO WIDTH NO-BREAK SPACE", notable because it's also used as
> the byte-order mark (as its counterpart, U+FFFE, is unallocated). I've
> been
> fighting with VLC Media Player over the font it uses for subtitles; for
> some bizarre reason, that font represents U+FEFF not with zero pixels of
> emptiness, but with a box containing the letters "ZWN" "BSP" on two
> lines.
> Yeah, because that totally takes up zero width and looks like blank
> space.

As I understand it, the proper behavior is that the ZWNBSP that is the
byte order mark shall never appear in an in-memory representation of the
first line of a BOM-encoded file, or any other line of the concatenation
of two BOM-encoded files, but should "vanish" when the file is opened
and first read from. So it shouldn't be showing up in your subtitles
regardless of its rendering behavior.

The real world, needless to say, isn't so nice.

IIRC there's also a font in MS windows that uses various glyphs which
are zero-width, but are not blank, to represent ZWJ, ZWNJ, RLM, and LRM.
Good for seeing what is happening, bad for actually rendering text
that's intended to contain these characters. Though there's another
argument that ideally a rendering engine should not render any such
glyph unless something like "visible controls" has been selected (the
real world, again, isn't so nice, which is why most symbols intended for
visible control style rendering have their own distinct code points
rather than using those of the control characters they represent).



More information about the Python-list mailing list