[Python-bugs-list] [ python-Feature Requests-403100 ] Multicharacter replacements in PyUnicode_TranslateCharmap

Wed, 04 Sep 2002 13:37:11 -0700

Feature Requests item #403100, was opened at 2001-01-04 18:50
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=403100&group_id=5470

Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Nobody/Anonymous (nobody)
Summary: Multicharacter replacements in PyUnicode_TranslateCharmap

Initial Comment:
This patch modifies Objects/unicodeobject.c/PyUnicode_TranslateCharmap,
so that the error

   PyErr_SetString(PyExc_NotImplementedError,
        "1-n mappings are currently not implemented");

no longer occurs. I.e.

   u"ab".translate({ord(u"a"): u"bbb", ord(u"b"): u"aaa"})

now works. It does this by exponentially
reallocating the string, when there is no more
available space.

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2002-09-04 22:37

Message:
Logged In: YES 
user_id=89016

This is implemented by the PEP 293 patch. Closing the 
request.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-08-09 23:02

Message:
Logged In: YES 
user_id=31435

Changed to Feature Requests, at MvL's request.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-07 14:32

Message:
Logged In: YES 
user_id=38388

Reopened. This should really be marked as feature request
but for some reason SF won't let me change the Data Type.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-06-07 12:09

Message:
Logged In: YES 
user_id=89016

The patch that was checked in changes 
PyUnicode_DecodeCharmap and PyUnicode_EncodeCharmap, but 
not PyUnicode_TranslateCharmap, where this functionality is 
also useful. . (e.g. for 
u"<foo>".translate({ord("<"): u"&lt;", ord(">"): u"&gt;"})
)

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-01-06 16:03

Message:
Checked in a different patch providing the same functionality.
Please see the CVS checking message for details.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2001-01-05 19:45

Message:
I'll checkin a patch for this tomorrow which implements what I had 
in mind. The patch doesn't change the performance of the charmap 
codec.

Thanks,
-- Marc-Andre

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2001-01-05 18:07

Message:
The problem, that you can't know beforehand how long
the result string will be, i.e. if there really will be any 1-n
replacements happening.

It would be possible to do a loop through the replacement
strings and see if there are any that are longer than one character,
but even if there are, you don't know if they will really be used.

So you have three choices:
(1) You either guess how much space you need and reallocate
when the space is not enough or 
(2) you do a dry run of the algorithm once and count how much 
space you need and do the algorithm a second time and this 
time use the strings.
(3) you can keep the strings in a list and join the list into
one string in the end.

For the case of 1-1 mapping the following will happen:

(1) The first allocation has exactly the right amount of space, 
there won't be any reallocations, but a size check for every
character will be don (which should be only a few assembler instructions).
The mapping will have to be accessed for every character
in the source string once.

(2) There will only be one allocation, but for every character in
the source string, the mapping has to be accessed twice, which
are calls to Python function, exception handling etc.

(3) You have to make as many memory allocations are are parts
of the final string that you create, including error handling etc.

I think (1) is clearly the fastest method.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2001-01-04 19:33

Message:
I like the idea, but the implementation needs some reworking:
the common case is 1-1 mapping so this should be as fast
as possible; extra size checks slow things down too much.

You can take a different approach, though:
leave things as they are and only add a special case for the 1-n
which does resizing depending on how many extra chars are inserted.
Then as final step, if resizing occurred, call _PyUnicode_Resize()
to cut down the allocate buffer to its true size.

-- Marc-Andre

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=403100&group_id=5470