Report on non-breaking spaces in posts

Tue Oct 31 13:55:09 EDT 2017

On 31/10/17 17:23, Stefan Ram wrote:
> Ned Batchelder <ned at nedbatchelder.com> writes:
>> Â Â Â  def wrapped_join(values, sep):
> 
>    Ok, here's a report on me seing non-breaking spaces in
>    posts in this NG. I have written this report so that you
>    can see that it's not my newsreader that is converting
>    something, because there is no newsreader involved.
> 
>    Here are some relevant lines from Ned's above post:
> 
> |From: Ned Batchelder <ned at nedbatchelder.com>
> |Newsgroups: comp.lang.python
> |Subject: Re: How to join elements at the beginning and end of the list
> |Message-ID: <mailman.95.1509464977.1490.python-list at python.org>

Hm.  That suggests the mail-to-news gateway has a hand in things.

> |Content-Type: text/plain; charset=utf-8; format=flowed
> |Content-Transfer-Encoding: 8bit
> | Â Â Â  def wrapped_join(values, sep):

[snippety snip]

> |od -c tmp.txt
> |...
> |0012620   s   u   l   a   t   e       i   t   :  \n  \n       Â       Â
> |0012640       Â           d   e   f       w   r   a   p   p   e   d   _
> |...
> |
> |od -x tmp.txt
> |...
> |0012620 7573 616c 6574 6920 3a74 0a0a c220 c2a0
> |0012640 c2a0 20a0 6564 2066 7277 7061 6570 5f64
> |...
> 
>    And you can see, there are two octet pairs »c220« and
>    »c2a0« in the post (directly preceding »def wrapped«).
>    (Compare with the Content-Type and Content-Transfer-Encoding
>    given above.) (Read table with a monospaced font:)
> 
>                          corresponding
> Codepoint      UTF-8    ISO-8859-1      interpretation
> 
> U+0020?        c2 20    20?             SPACE?
> U+00A0         c2 a0    a0              NON-BREAKING SPACE
> 
>    This makes it clear that there really are codepoints
>    U+00A0 in what I get from the server, i.e., non-breaking
>    spaces directly in front of »def wrapped«.

And?  Why does that bother you?  A non-breaking space is a perfectly 
valid thing to put into a UTF-8 encoded message.  The 0xc2 0x20 byte 
pair that you misidentify as a space is another matter entirely.

0xc2 0x20 is not a space in UTF-8.  It is an invalid code sequence.  I 
don't know how or where it was generated, but it really shouldn't have 
been.  It might have been Ned's MUA, or some obscure bug in the 
mail-to-news gateway.  Does anyone in a position to know have any opinions?

-- 
Rhodri James *-* Kynesim Ltd