stripping unwanted chars from string

Edward Elliott nobody at 127.0.0.1
Thu May 4 01:54:19 EDT 2006


John Machin wrote:
> [expletives deleted] and it was wrong anyway (according to your
> requirements);
> using \w would keep '_' which is *NOT* alphanumeric.

Actually the perl is correct, the explanation was the faulty part.  When in
doubt, trust the code.  Plus I explicitly allowed _ further down, so the
mistake should have been fairly obvious.


>  >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought
> required!! Monkey see, monkey type.

I won't dignify that with a response.  The code that is, I could give a toss
about the comments.  If you enjoy using such verbose, error-prone
representations in your code, god help anyone maintaining it.  Including
you six months later.  Quick, find the difference between these sets at a
glance:

'qwertyuiopasdfghjklzxcvbnm'
'abcdefghijklmnopqrstuvwxyz'
'abcdefghijklmnopprstuvwxyz'
'abcdefghijk1mnopqrstuvwxyz'
'qwertyuopasdfghjklzxcvbnm' # no fair peeking

And I won't even bring up locales.


>  >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
>  >>> fixer = lambda x: ''.join(c for c in x if c in keepchars)

Those darn monkeys, always think they're so clever! ;)
if "you can" == "you should": do(it)
else: do(not)


>> Sadly I can find no such beast.  Anyone have any insight?  As of now,
>> regexes look like the best solution.
> 
> I'll leave it to somebody else to dredge up the standard riposte to your
> last sentence :-)

If the monstrosity above is the best you've got, regexes are clearly the
better solution.  Readable trumps inscrutable any day.


> One point on your requirements: replacing unwanted characters instead of
> deleting them may be better -- theoretically possible problems with
> deleting are: (1) duplicates (foo and foo_ become the same) (2) '_' 
> becomes '' which is not a valid filename. 

Which is why I perform checks for emptiness and uniqueness after the strip. 
I decided long ago that stripping is preferable to replacement here.


> And a legibility problem: if 
> you hate '_' and ' ' so much, why not change them to '-'?

_ is allowed.  And I do prefer -, but not for legibility.  It doesn't
require me to hit Shift.

 
> Oh and just in case the fix was accidentally applied to a path:
> 
> keepchars.update(os.sep)
> if os.altsep: keepchars.update(os.altsep)

Nope, like I said this is strictly a filename.  Stripping out path
components is the first thing I do.  But thanks for pointing out these
common pitfalls for members of our studio audience.  Tell him what he's
won, Johnny! ;)




More information about the Python-list mailing list