[Tutor] about Regular Expression

Sun May 11 10:06:00 2003

On Sun, 11 May 2003, Abdirizak abdi wrote:

[text reflowed]

> Hi ,I was a working on program that processes xml files, the program
> has the following:
>
> <W>The</W> <W>most</W> <W>likely</W> <W>analysis</W>
>
> I want to extract the words in between <W>..</W> and I
> set up reg expression which gives me those words: that is o.k ['The',
> 'most', 'likely', 'analysis', ......] but the program has also a non-
> word tagged with <W>..</W>
>
> I was trying to set up a reg. expression
> that eliminates those tagged non-words which are giving me a bit of
> aproblem....
>
> <W>.</W> <W>,</W> <W>(</W> <W>)</W> <W>'</W>
>
> can anyone
> help me get this reg. expression wright ? thanks in advance

Hello Abdirizak,

instead of making the regexp complicater and complicater, also consider,
to surpress the non-words afterwards:

words = [x for x in mixed_words_non_words if x not in tupel('\'"(),.;:!?')]

Complicated regexp have the disadvantage of poor readability (and are
therefore hard to maintain). Perhaps it's easier to do it in two steps
within your programm.

Michael