a RegEx puzzle

Fri Mar 11 10:05:22 EST 2005

> Charles Hartman <charles.hartman at conncoll.edu> wrote:
>> I'm still shaky on some of sre's syntax. Here's the task: I've got
>> strings (never longer than about a dozen characters) that are
>> guaranteed to be made only of characters 'x' and '/'.
>
> One possibility is to cheat completely, and depending on memory 
> constraints
> and how often you need to do this, it might even make sense to do so.

Interesting -- I hadn't thought of that. It would, realistically, 
probably be 2 ** 14 or so, but what the hell. Trouble is, this would 
require me to pick out by hand/eye -- once only, admittedly -- 
something like 16,384 "longests," which I do not care to do. If I can 
compute them, instead, there's no reason not to do it on the fly; it 
doesn 't take *that* long.

> There's only two possible values for each character, so you're really
> dealing with binary numbers.  The max length is 12 characters (I'm
> deliberately being oblivious to where you said "about" :-)), so there 
> are
> only 2^12 or 4096 possible strings.  You could probably afford to
> pre-compute all 4096 strings, use them as dictionary keys, and store a
> (start, end) tuple for each key.  Computation then becomes a simple
> dictionary lookup.
>
>> In each string I
>> want to find the longest continuous stretch of pairs whose first
>> character is 'x' and the second is either mark. So if marks =
>> '/xx/xxx///', the "winning" stretch begins at position 2 and is 6
>> characters long ('x/xxx/'), which requires finding a second match that
>> overlaps the first match (which is just 'xx' in position 1). (When
>> there are multiple "winning" stretches, I might want to adjudicate
>> among them, but that's a separate problem.) I hope this is clear
>> enough.
>
> Unfortunately, no, it's not very clear.  In fact, I have no clue what
> you're trying to do.  Could you try explaining it again?

See below.

>> Charles Hartman
>> Professor of English, Poet in Residence
>
> Hmmm.  Are you, by any chance, looking for meter patterns in verse?

Yes, of course. My program that scans iambic pentameters (Scandroid, 
ver. 0.2a) is available (GPL) on the "Programs" page of my site listed 
below. It pretty much works, which is interesting. (People who know how 
to do this generally don't believe a program can do it.) I'm a week or 
two (knock on wood) from version 1.0, which handles anapestic meters as 
well, and non-pentameter lengths, and do other stuff.

I'm not sure which part of my explanation wasn't clear. The string ('/' 
and 'x' only) will ultimately be divided into groups of one, two, 
three, or four marks, by many complex operations. One approach to the 
whole task (an approach I just discovered a month or two ago, which 
supplements a more chunk-by-chunk method) is to look for the longest 
substring within that string of (perhaps) 10 or twelve mrks, of which 
the following is true: if the substring is divided into pairs, each 
pair consists of a 'x' followed by either a 'x' or a '/'. The general 
problem is to the find the longest possible substring matching an sre 
specfication -- and a general solution to that doesn't seem to be 
obvious, though the need for it must arise often enough.

What I want is code (a line, a function . . .) that returns the 
beginning-point and length of that longest substring. (I could ask for 
the substring itself; it doesn't matter; the information sought will be 
some subset of the information supplied with every sre match object.) 
For my current "best" solution, see previous message. Hope this is any 
clearer.

Charles Hartman
Professor of English, Poet in Residence
http://cherry.conncoll.edu/cohar
http://villex.blogspot.com