Remove duplicate letters in a word

Sun Jun 22 13:57:12 EDT 2003

Quoth Eliran Gonen:
  [...]
> For example, in the word SECRET, E appears twice, I need to get rid of
> the second instance of E.
> 
> I'm stuck here:
> 
>     for i in range(len(key)):
>         for j in range(len(key)):
>             if key[i] == key[j] :
>                 key[j] = ""

Close!

The immediate problem with this code is that strings are
immutable, so you can't change a character.  One easy solution for
this problem is to turn the string into a list, apply your code,
then turn the list back into a string:

    def withoutduplicates(string):
        chars = list(string)  # create list of characters
        for i in range(len(chars)):
            for j in range(len(chars)):
                if chars[i] == chars[j]:
                    chars[j] = ''
        return ''.join(chars)  # join characters back into a string

This code will run, but it has another problem:

    >>> withoutduplicates('repeat')
    ''

It removes *all* letters.  One way to discover why is to make the
code babble about what it's doing:

    def withoutduplicates(string):
        chars = list(string)
        for i in range(len(chars)):
            for j in range(len(chars)):
                if chars[i] == chars[j]:
                    ### added:
                    print 'characters at %d and %d are both %s' \
                          % (i, j, chars[i])
                    ###
                    chars[j] = ''
        return ''.join(chars)

    >>> withoutduplicates('repeat')
    characters at 0 and 0 are both r
    characters at 1 and 1 are both e
    characters at 2 and 2 are both p
    characters at 3 and 3 are both e
    characters at 4 and 4 are both a
    characters at 5 and 5 are both t
    ''

As this output shows, the function is comparing characters to
themselves, then, since they're equal to themselves, deleting them
as duplicates.  So: rather than loop j over all the characters,
loop only over the characters *after* the one pointed to by i:

    def withoutduplicates(string):
        chars = list(string)
        for i in range(len(chars)):
            for j in range(i+1, len(chars)):  # changed
                if chars[i] == chars[j]:
                    print 'characters at %d and %d are both %s' \
                          % (i, j, chars[i])
                    chars[j] = ''
        return ''.join(chars)

    >>> withoutduplicates('repeat')
    characters at 1 and 3 are both e
    'repat'

That looks right.  Getting rid of the print statement, and adding
documentation:

    def withoutduplicates(string):
        """Returns a copy of the string with duplicate characters removed.

        >>> withoutduplicates('repeat')
        'repat'

        """
        chars = list(string)
        for i in range(len(chars)):
            for j in range(i+1, len(chars)):
                if chars[i] == chars[j]:
                    chars[j] = ''
        return ''.join(chars)

(You can use the doctest module to turn the example in the
docstring into an automated test, which has many advantages.  See
the doctest docs.)

(This algorithm is O(n**2), meaning that if the string's length is
doubled, the function will take about four times as much time to
execute; if the string's length is tripled, the function will take
about nine times as long; and so forth.  For small strings this is
probably not a big deal; but if you want to perform this operation
on big strings, something like

    def withoutduplicates(string):
        seen = {}
        chars = []
        for c in string:
            if c not in seen:
                chars.append(c)
                seen[c] = 1
        return ''.join(chars)

(untested) might be preferable.  This is O(n); if the string's
length is tripled, it takes only about three times as long.)

-- 
Steven Taschuk                            staschuk at telusplanet.net
"Our analysis begins with two outrageous benchmarks."
  -- "Implementation strategies for continuations", Clinger et al.