stripping unwanted chars from string

John Machin sjmachin at lexicon.net
Thu May 4 01:01:06 EDT 2006


On 4/05/2006 1:36 PM, Edward Elliott wrote:
> I'm looking for the "best" way to strip a large set of chars from a filename
> string (my definition of best usually means succinct and readable).   I
> only want to allow alphanumeric chars, dashes, and periods.  This is what I
> would write in **** (bless me father, for I have sinned...):

[expletives deleted] and it was wrong anyway (according to your 
requirements);
using \w would keep '_' which is *NOT* alphanumeric.

> I could just use re.sub like the second example, but that's a bit overkill. 
> I'm trying to figure out if there's a good way to do the same thing with
> string methods.  string.translate seems to do what I want, the problem is
> specifying the set of chars to remove.  Obviously hardcoding them all is a
> non-starter.
> 
> Working with chars seems to be a bit of a pain.  There's no equivalent of
> the range function, one has to do something like this:
> 
>>>> [chr(x) for x in range(ord('a'), ord('z')+1)]
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
> 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

 >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought 
required!! Monkey see, monkey type.
 >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
 >>> fixer = lambda x: ''.join(c for c in x if c in keepchars)
 >>> fixer('qwe!@#456.--Howzat?')
'qwe456.--Howzat'
 >>>

> 
> Do that twice for letters, once for numbers, add in a few others, and I get
> the chars I want to keep.  Then I'd invert the set and call translate. 
> It's a mess and not worth the trouble.  Unless there's some way to expand a
> compact representation of a char list and obtain its complement, it looks
> like I'll have to use a regex.
> 
> Ideally, there would be a mythical charset module that works like this:
> 
>>>> keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'

Where'd that '_' come from?

>>>> toss = charset.invert (keep)
> 
> Sadly I can find no such beast.  Anyone have any insight?  As of now,
> regexes look like the best solution.

I'll leave it to somebody else to dredge up the standard riposte to your 
last sentence :-)

One point on your requirements: replacing unwanted characters instead of 
deleting them may be better -- theoretically possible problems with 
deleting are: (1) duplicates (foo and foo_ become the same) (2) '_' 
becomes '' which is not a valid filename. And a legibility problem: if 
you hate '_' and ' ' so much, why not change them to '-'?

Oh and just in case the fix was accidentally applied to a path:

keepchars.update(os.sep)
if os.altsep: keepchars.update(os.altsep)

HTH,
John



More information about the Python-list mailing list