[Tutor] regexp

Mon Nov 7 11:09:28 CET 2011

Nice solution indeed! Will it also work with accented characters? And how should one incorporate the collating sequence into the solution? By explicitly setting the locale? It might be nice if the outcome is always the same, whereever you are in the world.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>________________________________
>From: Terry Carroll <carroll at tjc.com>
>To: tutor at python.org
>Sent: Sunday, November 6, 2011 8:21 PM
>Subject: Re: [Tutor] regexp
>
>On Sat, 5 Nov 2011, Dinara Vakhitova wrote:
>
>> I need to find the words in a corpus, which letters are in the alphabetical
>> order ("almost", "my" etc.)
>> I started with matching two consecutive letters in a word, which are in
>> the alphabetical order, and tried to use this expression: ([a-z])[\1-z], but
>> it won't work, it's matching any sequence of two letters. I can't figure out
>> why... Evidently I can't refer to a group like this, can I? But how in this
>> case can I achieve what I need?
>
>First, I agree with the others that this is a lousy task for regular expressions.  It's not the tool I would use.  But, I do think it's doable, provided the requirement is not to check with a single regular expression. For simplicity's sake, I'll construe the problem as determining whether a given string consists entirely of lower-case alphabetic characters, arranged in alphabetical order.
>
>What I would do is set a variable to the lowest permissible character, i.e., "a", and another to the highest permissible character, i.e., "z" (actually, you could just use a constant, for the highest, but I like the symmetry.
>
>Then construct a regex to see if a character is within the lowest-permissible to highest-permissible range.
>
>Now, iterate through the string, processing one character at a time.  On each iteration:
>
>- test if your character meets the regexp; if not, your answer is
>   "false"; on pass one, this means it's not lower-case alphabetic; on
>   subsequent passes, it means either that, or that it's not in sorted
>   order.
>- If it passes, update your lowest permissible character with the
>   character you just processed.
>- regenerate your regexp using the updated lowest permissible character.
>- iterate.
>
>I assumed lower case alphabetic for simplicity, but you could modify this basic approach with mixed case (e.g., first transforming to all-lower-case copy) or other complications.
>
>I don't think there's a problem with asking for help with homework on this list; but you should identify it as homework, so the responders know not to just give you a solution to your homework, but instead provide you with hints to help you solve it.
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>To unsubscribe or change subscription options:
>http://mail.python.org/mailman/listinfo/tutor
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20111107/3bf26a5d/attachment-0001.html>