Python NBSP DWIM

Steven D'Aprano steve at pearwood.info
Wed Jun 10 13:11:47 EDT 2015


On Thu, 11 Jun 2015 12:28 am, Skip Montanaro wrote:

> On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
> <python.list at tim.thechases.com> wrote:
>> Is this a bug?
> 
> Looks like it's been reported a few times with slightly different context:
> 
> https://bugs.python.org/issue6537
> https://bugs.python.org/issue16623
> https://bugs.python.org/issue20491
> https://bugs.python.org/issue1390608
> 
> The couple times it's come up in the context of str.split, it's been
> rejected, since the purpose of that method is to split words.

That reasoning is ... strange. The whole point of the NBSP is specifically
*not* to split on it. If you wanted it to split, you would use a regular
space.

(Oh, and for the record, there are at least two non-breaking spaces in
Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)

http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/charts/PDF/U2000.pdf


Non-breaking spaces should be used for when you want to prevent
word-wrapping, and also for "open form" compound words:

http://grammar.ccc.commnet.edu/grammar/compounds.htm

textwrap should also treat NBSPs as non-spaces for the purposes of wrapping.

As a work-around, I think this should work:

- split the string on NBSPs;

- for substring returned, split normally;

- merge sub-substrings.


def split(s):
    """Split on whitespace, except NBSP.

    >>> split(u'hello world spam\\u00A0eggs cheese')
    [u'hello', u'world', u'spam\\xa0eggs', 'cheese']

    """
    words = []
    NBSP = u'\u00A0'
    substrings = s.split(NBSP)
    for i, sub in enumerate(substrings):
        parts = sub.split()
        if i == 0:
            words.extend(parts)
        else:
            words[-1] += NBSP + parts[0]
            words.extend(parts[1:])
    return words
        

-- 
Steven




More information about the Python-list mailing list