Find and Replace Simplification

Sat Jul 20 17:56:22 EDT 2013

On 07/20/2013 02:37 PM, Joshua Landau wrote:
> On 20 July 2013 19:04, Dave Angel <davea at davea.name> wrote:
>> On 07/20/2013 01:03 PM, Joshua Landau wrote:
>>>
>>> Still, it seems to me that it should be optimizable for sensible
>>> builtin types such that .translate is significantly faster, as there's
>>> no theoretical extra work that .translate *has* to do that .replace
>>> does not, and .replace also has to rebuild the string a lot of times.
>>>
>>
>> translate is going to be faster (than replace) for Unicode if it has a
>> "large" table.  For example, to translate from ASCII to EBCDIC, where every
>> character in the string is replaced by a new one.  I have no idea what the
>> cutoff is.  But of course, for a case like ASCII to EBCDIC, it would be very
>> tricky to do it with replaces, probably taking much more than the expected
>> 96 passes.
>
> My timings showed that for ".upper()", doing the full 26 passes "a" ->
> "A", it was *way* slower to use .translate than .replace, unless you
> used a list or equiv. with much faster lookup. Even then, it was
> slower to use .translate.
>
> I agree that for large tables it's obviously going to swing the other
> way, but by the time you're running .replace 26 times you wouldn't (at
> least I wouldn't) expect it still to be screamingly faster than
> .translate.
>
>> translate for byte strings is undoubtedly tons faster.  For byte strings,
>> the translation table is 256 bytes, and the inner loop is a simple lookup.
>
> For my above test, .translate is about 10x faster than iterated .replace.
>
>> But for Unicode, the table is a dict (or something very like it, I looked at
>> the C code, not the Python code).
>>
>> So for every character in the input string, it does a dict-type lookup,
>> before it can even decide if the character is going to change.
>
> The problem can be solved, I'd imagine, for builtin types. Just build
> an internal representation upon calling .translate that's faster. It's
> especially easy in the list case

What "list case"?  list doesn't have a replace() method or translate() 
method.

> -- just build a C array¹ at the start
> mapping int -> int and then have really fast C mapping speeds.

As long as you can afford to have a list with a billion or so entries in 
it.  We are talking about strings and version 3.3, aren't we?  Of 
course, one could always examine the mapping object (table) and see what 
the max value was, and only build a "C array" if it was smaller than say 
50,000.

>
> For dictionaries, you can do the same thing -- you just have to make
> sure you're not breaking any memory barriers.
>
> ¹ I don't do C or other low level languages, so my knowledge in this
> area is embarrassingly bad
>
>> Just for reference, the two files I was looking at were:
>>
>> objects/unicodeobject.c
>> objects/bytesobject.c
>>
>> Extracted from the bz2 downloaded from the page:
>>      http://hg.python.org/cpython
>
> I didn't look at bytes first time, I might take a look later.
>

-- 
DaveA