[Python-ideas] New explicit methods to trim strings

Steven D'Aprano steve at pearwood.info
Mon Apr 1 20:52:52 EDT 2019


On Mon, Apr 01, 2019 at 02:29:44PM +1100, Chris Angelico wrote:

> The multiple affix case has exactly two forms:
> 
> 1) Tearing multiple affixes off (eg stripping "asdf.jpg.png" down to
> just "asdf"), which most people are saying "no, don't do that, it
> doesn't make sense and isn't needed"

Perhaps I've missed something obvious (its been a long thread, and I'm 
badly distracted with hardware issues that are causing me some 
considerable grief), but I haven't seen anyone say "don't do that".

But I have seen David Mertz say that this was the best behaviour:

[quote]

    fname = 'silly.jpg.png.gif.png.jpg.gif.jpg'

    I'm honestly not sure what behavior would be useful most often for 
    this oddball case.  For the suffixes, I think "remove them all" is 
    probably the best

[end quote]

I'd also like to point out that this is not an oddball case. There are 
two popular platforms where file extensions are advisory not mandatory 
(Linux and Mac), but even on Windows it is possible to get files with 
multiple, meaningful, extensions (foo.tar.gz for example) as well as 
periods used in place of spaces (a.funny.cat.video.mp4).


> 2) Removing one of several options, which implies that one option is a
> strict subpiece of another (eg stripping off "test" and "st")

I take it you're only referring to the problematic cases, because 
there's the third option, where none of the affixes to be removed clash:

    spam.cut_suffix(("ed", "ing"))

But that's pretty uninteresting and a simple loop or repeated call to 
the method will work fine:

    spam.cut_suffix("ed").cut_suffix("ing")

just as we do with replace:

    spam.replace(",", "").replace(" ", "")


If you only have a few affixes to work with, this is fine. If you have a 
lot, you may want a helper function, but that's okay.

 
> If anyone is advocating for #1, I would agree with saying YAGNI.

David Mertz did.


> But #2 is an extremely unlikely edge case, and whatever semantics are
> chosen for it, *normal* usage will not be affected.

Not just unlikely, but "extremely" unlikely?

Presumably you didn't just pluck that statement out of thin air, but 
have based it on an objective and statistically representative review of 
existing code and projections of future uses of these new methods. How 
could I possibly argue with that?

Except to say that I think it is recklessly irresponsible for people 
engaged in language design to dismiss edge cases which will cause users 
real bugs and real pain so easily. We're not designing for our personal 
toolbox, we're designing for hundreds of thousands of other people with 
widely varying needs.

It might be rare for you, but for somebody it will be happening ten 
times a day. And for somebody else, it will only happen once a year, but 
when it does, their code won't raise an exception it will just silently 
do the wrong thing.

This is why replace does not take a set of multiple targets to replace. 
The user, who knows their own use-case and what behaviour they want, can 
write their own multiple-replace function, and we don't have to guess 
what they want.

The point I am making is not that we must not ever support multiple 
affixes, but that we shouldn't rush that decision. Let's pick the 
low-hanging fruit, and get some real-world experience with the function 
before deciding how to handle the multiple affix case.


[...]
> Or all the behaviours actually do the same thing anyway.

In this thread, I keep hearing this message:

"My own personal use-case will never be affected by clashing affixes, so 
I don't care what behaviour we build into the language, so long as we 
pick something RIGHT NOW and don't give the people actually affected 
time to use the method and decide what works best in practice for them."

Like for the str.replace method, the final answer might be "there is no 
best behaviour and we should refuse to choose".

Why are we rushing to permanently enshrine one specific behaviour into 
the builtins before any of the users of the feature have a chance to use 
it and decide for themselves which suits them best?

    Now is better than never.
    Although never is often better than *right* now.

Somebody (I won't name names, but they know who they are) wrote to me 
off-list some time ago and accused me of being arrogant and thinking I 
know more than everyone else. Well perhaps I am, but I'm not so arrogant 
as to think that I can choose the right behaviour for clashing affixes 
for other people when my own use-cases don't have clashing affixes.


[...]
> Sure, but I've often wanted to do something like "strip off a prefix
> of http:// or https://", or something else that doesn't have a
> semantic that's known to the stdlib.

I presume there's a reason you aren't using urllib.parse and you just 
need a string without the leading scheme. If you're doing further 
parsing, the stdlib has the right batteries for that.

(Aside: perhaps urllib.parse.ParseResult should get an attribute to 
return the URL minus the scheme? That seems like it would be useful.)


> Also, this is still fairly
> verbose, and a lot of people are going to reach for a regex, just
> because it can be done in one line of code.

Okay, they will use a regex. Is this a problem? We're not planning on 
banning regexes are we? If they're happy using regexes, and don't care 
that it will be perhaps 3 times slower, let them.


> > I posted links to prior art. Unless I missed something, not one of those
> > languages or libraries supports multiple affixes in the one call.
> 
> And they don't support multiple affixes in startswith/endswith either,
> but we're very happy to have that in Python.

But not until we had a couple of releases of experience with them:

https://docs.python.org/2.7/library/stdtypes.html#str.endswith

And .replace still only takes a single target to be replaced.


[...]
> We don't have to worry about edge cases that are
> unlikely to come up in real-world code,

And you are making that pronouncement on the basis of what? Your gut 
feeling? Perhaps you're thinking too narrowly.

Here's a partial list of English prefixes that somebody doing text 
processing might want to remove to get at the root word:

    a an ante anti auto circum co com con contra contro de dis
    en ex extra hyper il im in ir inter intra intro macro micro
    mono non omni post pre pro sub sym syn tele un uni up

I count fourteen clashes:

    a: an ante anti
    an: ante anti
    co: com con contra contro
    ex: extra
    in: inter intra intro
    un: uni

(That's over a third of this admittedly incomplete list of prefixes.)

I can think of at least one English suffix pair that clash: -ify, -fy.

How about other languages? How comfortable are you to say that nobody 
doing text processing in German or Hindi will need to deal with clashing 
affixes?



-- 
Steven


More information about the Python-ideas mailing list