Delete all not allowed characters..

Paul Hankin paul.hankin at gmail.com
Thu Oct 25 19:57:56 EDT 2007


On Oct 26, 12:05 am, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
> >> Repeatedly adding strings together in this way is about the most
> >> inefficient, slow way of building up a long string. (Although I'm sure
> >> somebody can come up with a worse way if they try hard enough.)
>
> >> Even though recent versions of CPython have a local optimization that
> >> improves the performance hit of string concatenation somewhat, it is
> >> better to use ''.join() rather than add many strings together:
>
> > String appending is not tragically slower, for strings long tens of MB,
> > the speed makes me a difference in few tens of percents, so it is not
> > several times slower, or so
>
> That is a half-truth.
>
> Because strings are immutable, when you concat two strings Python has to
> duplicate both of them. This leads to quadratic-time behaviour, where the
> time taken is proportional to the SQUARE of the number of characters.
> This rapidly becomes very slow.
>
> *However*, as of Python 2.4, CPython has an optimization that can detect
> some types of string concatenation and do them in-place, giving (almost)
> linear-time performance. But that depends on:
>
> (1) the implementation: it only works for CPython, not Jython or
> IronPython or other Python implementations;
>
> (2) the version: it is an implementation detail introduced in Python 2.4,
> and is not guaranteed to remain in future versions;
>
> (3) the specific details of how you concat strings: s=s+t will get the
> optimization, but s=t+s or s=s+t1+t2 will not.
>
> In other words: while having that optimization in place is a win, you
> cannot rely on it. If you care about portable code, the advice to use
> join() still stands.
>
> [snip]
>
> > Nice, I did not know that string translation exists, but Abandoned have
> > defined allowed characters, so making a translation table for the
> > unallowed characters, which would take nearly complete unicode character
> > table would be inefficient.
>
> The cost of building the unicode translation table is minimal: about 1.5
> seconds ONCE, and it is only a couple of megabytes of data:
>
> >>> allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
>
> ... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
>
> >>> timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if
>
> unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
> (not_allowed)))', 'from __main__ import allowed')
>
> >>> timer.repeat(3, 10)
>
> [18.267689228057861, 16.495684862136841, 16.785034894943237]
>
> The translate method runs about ten times faster than anything you can
> write in pure Python. If Abandoned has got as much data as he keeps
> saying he has, he will save a lot more than 1.5 seconds by using
> translate compared to relatively slow Python code.

String translate runs 10 times faster than pure python: unicode
translate isn't anywhere near as fast as it has to look up each
character in the mapping dict.

import timeit

timer = timeit.Timer("a.translate(m)", setup = "a = u'abc' * 1000; m =
dict((x, x) for x in range(256))")

print timer.repeat(3, 10000)

[2.4009871482849121, 2.4191598892211914, 2.3641388416290283]


timer = timeit.Timer("a.translate(m)", setup = "a = 'abc' * 1000; m =
''.join(chr(x) for x in range(256))")

print timer.repeat(3, 10000)

[0.12261486053466797, 0.12225103378295898, 0.12217879295349121]


Also, the unicode translation dict as given doesn't work on
character's that aren't allowed: it should map ints to ints rather
than ints to strings.

Anyway, there's no need to pay the cost of building a full mapping
dict when most of the entries are the same. Something like this can
work:

from collections import defaultdict

def clear(message):
    allowed = u'abc...'
    clear_translate = defaultdict(lambda: ord(u' '))
    clear_translate.update((c, c) for c in map(ord, allowed))
    return message.translate(clear_translate)

--
Paul Hankin




More information about the Python-list mailing list