split string into multi-character "letters"

Vlastimil Brom vlastimil.brom at gmail.com
Wed Aug 25 16:08:19 EDT 2010


2010/8/25 Jed <jedmeltzer at gmail.com>:
> Hi, I'm seeking help with a fairly simple string processing task.
> I've simplified what I'm actually doing into a hypothetical
> equivalent.
> Suppose I want to take a word in Spanish, and divide it into
> individual letters.  The problem is that there are a few 2-character
> combinations that are considered single letters in Spanish - for
> example 'ch', 'll', 'rr'.
> Suppose I have:
>
> alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
> the whole alphabet but I shortened it here
> theword = 'churro'
>
> I would like to split the string 'churro' into a list containing:
>
> 'ch','u','rr','o'
>
> So at each letter I want to look ahead and see if it can be combined
> with the next letter to make a single 'letter' of the Spanish
> alphabet.  I think this could be done with a regular expression
> passing the list called "alphabet" to re.match() for example, but I'm
> not sure how to use the contents of a whole list as a search string in
> a regular expression, or if it's even possible.  My real application
> is a bit more complex than the Spanish alphabet so I'm looking for a
> fairly general solution.
> Thanks,
> Jed
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
I am not sure, whether it can be generalised enough for your needs,
but you can try something like
>>> re.findall(r"rr|ll|ch|[a-z]", "asdasdallasdrrcvb")
['a', 's', 'd', 'a', 's', 'd', 'a', 'll', 'a', 's', 'd', 'rr', 'c', 'v', 'b']

of course, the pattern should be adjusted precisely in order not to
loose characters...

hth,
  vbr



More information about the Python-list mailing list