returning regex matches as lists

John Machin sjmachin at lexicon.net
Sat Feb 16 06:44:11 EST 2008


On Feb 16, 8:25 am, Jonathan Lukens <jonathan.luk... at gmail.com> wrote:
> > What would you like to see instead?
>
> I had mostly just expected that there was some method that would
> return each entire match as an item on a list.  I have this pattern:
>
> >>> import re
> >>> corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
> >>> terms = corporate_names.findall(sourcetext)
>
> Which matches a specific way that Russian company names are
> formatted.  I was expecting a method that would return this:
>
> >>> terms
>
> [u'string one', u'string two', u'string three']

What is the point of having parenthesised groups in the regex if you
are interested only in the whole match?

Other comments:
(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'
(2) consider not including space at the end of a group
ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])'
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

>
> ...mostly because I was working it this way in Java and haven't
> learned to do things the Python way yet.  At the suggestion from
> someone on the list, I just used list() on all the tuples like so:
>
> >>> detupled_terms = [list(term_tuple) for term_tuple in terms]
> >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]
>
> which achieves the desired result, but I am not a programmer and so I
> would still be interested to know if there is a more elegant way of
> doing this.

I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way". In any case, explore the correctness
axis first.

Cheers,
John



More information about the Python-list mailing list