Delete all not allowed characters..
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Thu Oct 25 19:05:54 EDT 2007
On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:
>> Repeatedly adding strings together in this way is about the most
>> inefficient, slow way of building up a long string. (Although I'm sure
>> somebody can come up with a worse way if they try hard enough.)
>>
>> Even though recent versions of CPython have a local optimization that
>> improves the performance hit of string concatenation somewhat, it is
>> better to use ''.join() rather than add many strings together:
>>
>>
> String appending is not tragically slower, for strings long tens of MB,
> the speed makes me a difference in few tens of percents, so it is not
> several times slower, or so
That is a half-truth.
Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.
*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:
(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;
(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;
(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.
In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.
[snip]
> Nice, I did not know that string translation exists, but Abandoned have
> defined allowed characters, so making a translation table for the
> unallowed characters, which would take nearly complete unicode character
> table would be inefficient.
The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:
>>> allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
>>>
>>> timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if
unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>>>
>>> timer.repeat(3, 10)
[18.267689228057861, 16.495684862136841, 16.785034894943237]
The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.
On the other hand, if he is translating only small strings, with
different sets of allowed chars each time, then there is no advantage to
using the translate method.
And on the third hand... I can't help but feel that the *right* solution
to Abandoned's problem is to use encode/decode with the appropriate codec.
--
Steven.
More information about the Python-list
mailing list