Confusing textwrap parameters, and request for RE help

Chris Angelico rosuav at gmail.com
Sun Mar 22 15:00:41 EDT 2020


When using textwrap.fill() or friends, setting break_long_words=False
without also setting break_on_hyphens=False has the very strange
behaviour that a long hyphenated word will still be wrapped. I
discovered this as a very surprising result when trying to wrap a
paragraph that contained a URL, and wanting the URL to be kept
unchanged:

import shutil, re, textwrap
w = textwrap.TextWrapper(width=shutil.get_terminal_size().columns,
    initial_indent=">>>> ", subsequent_indent="     ",
    break_long_words=False)
text = ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
    "Praesent facilisis risus sed quam ultricies tempus. Vestibulum "
    "quis tincidunt turpis, id euismod turpis. Etiam pellentesque sem "
    "a magna laoreet sagittis. Proin id convallis odio, vel ornare "
    "ipsum. Vivamus tincidunt sodales orci. Proin egestas sollicitudin "
    "viverra. Etiam egestas, erat ac elementum tincidunt, risus nisl "
    "fermentum ex, ut iaculis lorem quam vel erat. Pellentesque "
    "condimentum ipsum varius ligula volutpat, sit amet pulvinar ipsum "
    "condimentum. Nulla ut orci et mi sollicitudin vehicula. In "
    "consectetur aliquet tortor, sed commodo orci porta malesuada. "
    "Nulla urna sapien, sollicitudin et nunc et, ultrices euismod "
    "lorem. Curabitur tempor est mauris, a ultricies urna mattis id. "
    "Nam efficitur turpis a sem tristique sagittis. "
    "https://example.com/long-url-goes-here/with-multiple-parts/wider-than-your-terminal/should-not-be-wrapped
"
    "more words go here")
print(w.fill(text))

Unless you also add break_on_hyphens, this will appear to ignore the
break_long_words setting. Logically, it would seem that
break_long_words=False should take precedence over break_on_hyphens -
if you're not allowed to break a long word, why break it at a hyphen?

Second point, and related to the above. The regex that defines break
points, as found in the source code, is:

wordsep_re = re.compile(r'''
        ( # any whitespace
          %(ws)s+
        | # em-dash between words
          (?<=%(wp)s) -{2,} (?=\w)
        | # word, possibly hyphenated
          %(nws)s+? (?:
            # hyphenated word
              -(?: (?<=%(lt)s{2}-) | (?<=%(lt)s-%(lt)s-))
              (?= %(lt)s -? %(lt)s)
            | # end of word
              (?=%(ws)s|\Z)
            | # em-dash
              (?<=%(wp)s) (?=-{2,}\w)
            )
        )''' % {'wp': word_punct, 'lt': letter,
                'ws': whitespace, 'nws': nowhitespace},

It's built primarily out of small matches with long assertions, eg
"match a hyphen, as long as it's preceded by two letters or a letter
and a hyphen". What I want to do is create a *negative* assertion:
specifically, to disallow any breaking between "\b[a-z]+://" and "\b",
which will mean that a URL will never be broken ("https://.........."
until the next whitespace boundary). Regex assertions of this form
have to be fixed lengths, though, so as described, this isn't
possible. Regexperts, any ideas? How can I modify this to never break
inside a URL?

Ultimately it would be great to have a break_urls parameter
(defaulting to True, and if False, will change the RE as described),
but since I can just patch in a different RE, that would work for
current Pythons.

I'd prefer to still be able to break words on hyphens while keeping
URLs intact, but if it's not possible, I'll just deny all hyphen
breaking. Seems unnecessary though.

ChrisA


More information about the Python-list mailing list