[Python-ideas] New explicit methods to trim strings

Mon Apr 1 21:34:21 EDT 2019

On Mon, Apr 1, 2019, 8:54 PM Steven D'Aprano <steve at pearwood.info> wrote:

> The point I am making is not that we must not ever support multiple
> affixes, but that we shouldn't rush that decision. Let's pick the
> low-hanging fruit, and get some real-world experience with the function
> before deciding how to handle the multiple affix case.
>

There are exactly two methods of strings that deal specifically with
affixes currently. Startswith and endswith. Both of those allow specifying
multiple affixes. That's pretty strong real-world experience, and breaking
the symmetry for no reason is merely confusing. Especially since the
consistency would be obviously as commonly useful.

Now look, the sky won't fall if a single-affix-only method is added. For
that matter, it won't if nothing is added. In fact, the single affix
version makes it a little bit easier to write a custom function handling
multiple affixes.

And the sky won't fall if the remove-just-one semantics are used rather
than remove-from-class.

But adding methods with sneakily helpful capabilities often helps users
greatly. A lot of folks in this thread didn't even know about passing a
tuple to str.startswith() a few days ago. I'm pretty sure that capability
was added by Raymond, who has an amazingly good sense of what little tricks
can prove really powerful. Apologies to a different developer if it wasn't
him, but congrats and thanks to you if so.

Somebody (I won't name names, but they know who they are) wrote to me
> off-list some time ago and accused me of being arrogant and thinking I know
> more than everyone else. Well perhaps I am, but I'm not so arrogant as to
> think that I can choose the right behaviour for clashing affixes for other
> people when my own use-cases don't have clashing affixes.
>

That could be me... Unless it's someone else :-). I think my intent was a
bit different than you characterize, but I'm very guilty of presuming too
much also. So mea culpa.

> Sure, but I've often wanted to do something like "strip off a prefix
> > of http:// or https://", or something else that doesn't have a
> > semantic that's known to the stdlib.
>
> I presume there's a reason you aren't using urllib.parse and you just need
> a string without the leading scheme. If you're doing further parsing, the
> stdlib has the right batteries for that.
>

I know there are lots of specialized string manipulations in the STDLIB.
Yeah, I could use os.path.splitext, and os.path.split, and
urllib.parse.something, and lots of other things I rarely use. A lot of us
like to manipulate strings in generically stringy ways.

But not until we had a couple of releases of experience with them:
>
> https://docs.python.org/2.7/library/stdtypes.html#l.endswith
> <https://docs.python.org/2.7/library/stdtypes.html#str.endswith>

Ok. Fair point. I used Python 2.4 without the multiple affix option.

Here's a partial list of English prefixes that somebody doing text
> processing might want to remove to get at the root word:
>
>     a an ante anti auto circum co com con contra contro de dis
>     en ex extra hyper il im in ir inter intra intro macro micro
>     mono non omni post pre pro sub sym syn tele un uni up
>
> I count fourteen clashes:
>
>     a: an ante anti
>     an: ante anti
>     co: com con contra contro
>     ex: extra
>     in: inter intra intro
>     un: uni
>

This seems like a good argument for remove-all-from-class. :-)

    stem = word.lstrip(prefix_tup)

But the we really need 'word.porter_stemmer()' as a built-in method.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190401/e4e576d1/attachment.html>