Whittle it on down

Steven D'Aprano steve at pearwood.info
Thu May 5 13:43:40 EDT 2016


On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:

> On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
>> Oh, a further thought...
>> 
>> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>> > I don't even care about faster: Its overly complicated. Sometimes a
>> > regular expression really is the clearest way to solve a problem.
>> 
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> I don't know, but mostly because I wouldn't even try. 

Really? Peter Otten seems to have found a solution, and Random832 almost
found it too.


> The requirements 
> are over-specified. If you look at the OP's data (and based on previous
> conversation), he's doing web scraping and trying to pull out good data.

I'm not talking about the OP's data. I'm talking about *my* requirements.

I thought that this was a friendly discussion about regexes, but perhaps I
was mistaken. Because I sure am feeling a lot of hostility to the ideas
that regexes are not necessarily the only way to solve this, and that data
validation is a good thing.


> There's no absolutely perfect way to do that because the system he's
> scraping isn't meant for data processing. The data isn't cleanly
> articulated.

Right. Which makes it *more*, not less, important to be sure that your regex
doesn't match too much, because your data is likely to be contaminated by
junk strings that don't belong in the data and shouldn't be accepted. I've
done enough web scraping to realise just how easy it is to start grabbing
data from the wrong part of the file.


> Instead, he wants a heuristic to pull out what look like section titles.

Good for him. I asked a different question. Does my question not count?


> The OP looked at the data and came up with a simple set of rules that
> identify these section titles:
> 
>>> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)

That simple rule doesn't match his examples, as I know too well because I
made the silly mistake of writing to the written spec as written without
reading the examples as well. As I already admitted. That was a silly
mistake because I know very well that people are really bad at writing
detailed specs that neither match too much nor too little.

But you know, I was more focused on the rest of his question, namely whether
it was better to extract the matches strings into a new list, or delete the
non-matches from the existing string, and just got carried away writing the
match function. I didn't actually expect anyone to use it. It was untested,
and I hinted that a regex would probably be better.

I was trying to teach DFS a generic programming technique, not solve his
stupid web scraping problem for him. What happens next time when he's
trying to filter a list of floats, or Widgets? Should he convert them to
strings so he can use a regex to match them, or should he learn about
general filtering techniques?


> This translates naturally into a simple regular expression: an uppercase
> string with spaces and &'s. Now, that expression doesn't 100% encode
> every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> looking at the data, I suspect its good enough. The titles are clearly
> separate from the other data scraped by their being upper cased. We just
> need to expand our allowed character range into spaces and &'s.
> 
> Nothing in the OP's request demands the kind of rigorous matching that
> your scenario does. Its a practical problem with a simple, practical
> answer.

Yes, and that practical answer needs to reject:

- the empty string, because it is easy to mistakenly get empty strings when
scraping data, especially if you post-process the data;

- strings that are all spaces, because "       " cannot possibly be a title;

- strings that are all ampersands, because "&&&&&" is not a title, and it
almost surely indicates that your scraping has gone wrong and you're
reading junk from somewhere;

- even leading and trailing spaces are suspect: "  FOO  " doesn't match any
of the examples given, and it seems unlikely to be a title. Presumably the
strings have already been filtered or post-processed to have leading and
trailing spaces removed, in which case "  FOO  " reveals a bug.

 

-- 
Steven




More information about the Python-list mailing list