Delete all not allowed characters..

Thu Oct 25 19:05:54 EDT 2007

On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote:

>> Repeatedly adding strings together in this way is about the most
>> inefficient, slow way of building up a long string. (Although I'm sure
>> somebody can come up with a worse way if they try hard enough.)
>> 
>> Even though recent versions of CPython have a local optimization that
>> improves the performance hit of string concatenation somewhat, it is
>> better to use ''.join() rather than add many strings together:
>> 
>> 
> String appending is not tragically slower, for strings long tens of MB,
> the speed makes me a difference in few tens of percents, so it is not
> several times slower, or so

That is a half-truth.

Because strings are immutable, when you concat two strings Python has to 
duplicate both of them. This leads to quadratic-time behaviour, where the 
time taken is proportional to the SQUARE of the number of characters. 
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect 
some types of string concatenation and do them in-place, giving (almost) 
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or 
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4, 
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the 
optimization, but s=t+s or s=s+t1+t2 will not.

In other words: while having that optimization in place is a win, you 
cannot rely on it. If you care about portable code, the advice to use 
join() still stands.

[snip]

> Nice, I did not know that string translation exists, but Abandoned have
> defined allowed characters, so making a translation table for the
> unallowed characters, which would take nearly complete unicode character
> table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5 
seconds ONCE, and it is only a couple of megabytes of data:

>>> allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
>>>
>>> timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if 
unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')
>>> 
>>> timer.repeat(3, 10)
[18.267689228057861, 16.495684862136841, 16.785034894943237]

The translate method runs about ten times faster than anything you can 
write in pure Python. If Abandoned has got as much data as he keeps 
saying he has, he will save a lot more than 1.5 seconds by using 
translate compared to relatively slow Python code.

On the other hand, if he is translating only small strings, with 
different sets of allowed chars each time, then there is no advantage to 
using the translate method.

And on the third hand... I can't help but feel that the *right* solution 
to Abandoned's problem is to use encode/decode with the appropriate codec.

-- 
Steven.