Whittle it on down

Thu May 5 14:55:11 EDT 2016

On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote:
> On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:
> 
> > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> >> Oh, a further thought...
> >> 
> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> >> > I don't even care about faster: Its overly complicated. Sometimes a
> >> > regular expression really is the clearest way to solve a problem.
> >> 
> >> Putting non-ASCII letters aside for the moment, how would you match these
> >> specs as a regular expression?
> > 
> > I don't know, but mostly because I wouldn't even try. 
> 
> Really? Peter Otten seems to have found a solution, and Random832 almost
> found it too.
> 
> 
> > The requirements 
> > are over-specified. If you look at the OP's data (and based on previous
> > conversation), he's doing web scraping and trying to pull out good data.
> 
> I'm not talking about the OP's data. I'm talking about *my* requirements.
> 
> I thought that this was a friendly discussion about regexes, but perhaps
> I
> was mistaken. Because I sure am feeling a lot of hostility to the ideas
> that regexes are not necessarily the only way to solve this, and that
> data
> validation is a good thing.

Umm, what? Hostility? I have no idea where you're getting that.

I didn't say that regexs are the only way to solve problems; in fact
they're something I avoid using in most cases. In the OP's case, though,
I did say I thought was a natural fit. Usually, I'd go for
startswith/endswith, "in", slicing and such string primitives before I
go for a regular expression.

"Find all upper cased phrases that may have &'s in them" is something
just specific enough that the built in string primitives are awkward
tools.

In my experience, most of the problems with regexes is people think
they're the hammer and every problem is a nail: and then they get into
ever more convoluted expressions that become brittle.  More specific in
a regular expression is not, necessarily, a virtue. In fact its exactly
the opposite a lot of times.

> > There's no absolutely perfect way to do that because the system he's
> > scraping isn't meant for data processing. The data isn't cleanly
> > articulated.
> 
> Right. Which makes it *more*, not less, important to be sure that your
> regex
> doesn't match too much, because your data is likely to be contaminated by
> junk strings that don't belong in the data and shouldn't be accepted.
> I've
> done enough web scraping to realise just how easy it is to start grabbing
> data from the wrong part of the file.

I have nothing against data validation: I don't think it belongs in
regular expressions, though. That can be a step done afterwards.

> > Instead, he wants a heuristic to pull out what look like section titles.
> 
> Good for him. I asked a different question. Does my question not count?

Sure it counts, but I don't want to engage in your theoretical exercise.
That's not being hostile, that's me not wanting to think about a complex
set of constraints for a regular expression for purely intellectual
reasons.

> I was trying to teach DFS a generic programming technique, not solve his
> stupid web scraping problem for him. What happens next time when he's
> trying to filter a list of floats, or Widgets? Should he convert them to
> strings so he can use a regex to match them, or should he learn about
> general filtering techniques?

Come on. This is a bit presumptuous, don't you think?

> > This translates naturally into a simple regular expression: an uppercase
> > string with spaces and &'s. Now, that expression doesn't 100% encode
> > every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> > looking at the data, I suspect its good enough. The titles are clearly
> > separate from the other data scraped by their being upper cased. We just
> > need to expand our allowed character range into spaces and &'s.
> > 
> > Nothing in the OP's request demands the kind of rigorous matching that
> > your scenario does. Its a practical problem with a simple, practical
> > answer.
> 
> Yes, and that practical answer needs to reject:
> 
> - the empty string, because it is easy to mistakenly get empty strings
> when
> scraping data, especially if you post-process the data;
> 
> - strings that are all spaces, because "       " cannot possibly be a
> title;
> 
> - strings that are all ampersands, because "&&&&&" is not a title, and it
> almost surely indicates that your scraping has gone wrong and you're
> reading junk from somewhere;
> 
> - even leading and trailing spaces are suspect: "  FOO  " doesn't match
> any
> of the examples given, and it seems unlikely to be a title. Presumably
> the
> strings have already been filtered or post-processed to have leading and
> trailing spaces removed, in which case "  FOO  " reveals a bug.

We're going to have to agree to disagree. I find all of that
unnecessary.  Any validation can be easily done before or after
matching, you don't need to over-complicate the regular expression
itself. The urge to find an ever more perfect regular expression that
manages to encapsulate what is precisely correct and what is not leads
itself to over-complicated expressions.

And regular expressions are ugly. I'd rather keep them simple and
straight-forward and deal with the rest in Python.

-- 
Stephen Hansen
  m e @ i x o k a i . i o