Confusing textwrap parameters, and request for RE help

Chris Angelico rosuav at gmail.com
Wed Mar 25 16:11:46 EDT 2020


On Thu, Mar 26, 2020 at 6:34 AM Peter J. Holzer <hjp-python at hjp.at> wrote:
>
> On 2020-03-23 06:00:41 +1100, Chris Angelico wrote:
> > Second point, and related to the above. The regex that defines break
> > points, as found in the source code, is:
> >
> > wordsep_re = re.compile(r'''
> >         ( # any whitespace
> >           %(ws)s+
> >         | # em-dash between words
> >           (?<=%(wp)s) -{2,} (?=\w)
> >         | # word, possibly hyphenated
> >           %(nws)s+? (?:
> >             # hyphenated word
> >               -(?: (?<=%(lt)s{2}-) | (?<=%(lt)s-%(lt)s-))
> >               (?= %(lt)s -? %(lt)s)
> >             | # end of word
> >               (?=%(ws)s|\Z)
> >             | # em-dash
> >               (?<=%(wp)s) (?=-{2,}\w)
> >             )
> >         )''' % {'wp': word_punct, 'lt': letter,
> >                 'ws': whitespace, 'nws': nowhitespace},
> >
> > It's built primarily out of small matches with long assertions, eg
> > "match a hyphen, as long as it's preceded by two letters or a letter
> > and a hyphen".
>
> Do you need that fancy logic? Could you only break on white-space
> instead? It won't wrap "tetrabromo-phenolsulfonephthalein" in that case
> but since you mentioned its for a twitter client, most users probably
> won't mind (and those who do mind will probably insist that the
> algorithm should be able to split it into tetrabromo-phenolsulfone-
> phthalein, if that's where the line end is, as it was here purely by
> lucky accident). A regexp for whitespace is pretty simple.
>

If I *just* want to break on whitespace, I can do that (set both flags
to False, off it goes). And in fact, that's what I've done so far, and
it's working reasonably well. But I was hoping to be more flexible
than that.

ChrisA


More information about the Python-list mailing list