Whittle it on down

DFS nospam at dfs.com
Thu May 5 10:36:27 EDT 2016


On 5/5/2016 9:32 AM, Stephen Hansen wrote:
> On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
>> Oh, a further thought...
>>
>> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>>> I don't even care about faster: Its overly complicated. Sometimes a
>>> regular expression really is the clearest way to solve a problem.
>>
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
>
> I don't know, but mostly because I wouldn't even try. The requirements
> are over-specified. If you look at the OP's data (and based on previous
> conversation), he's doing web scraping and trying to pull out good data.
> There's no absolutely perfect way to do that because the system he's
> scraping isn't meant for data processing. The data isn't cleanly
> articulated.
>
> Instead, he wants a heuristic to pull out what look like section titles.


Assigned by a company named localeze, apparently.

http://www.usdirectory.com/cat/g0

https://www.neustarlocaleze.biz/welcome/



> The OP looked at the data and came up with a simple set of rules that
> identify these section titles:
>
>>> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)
>
> This translates naturally into a simple regular expression: an uppercase
> string with spaces and &'s. Now, that expression doesn't 100% encode
> every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> looking at the data, I suspect its good enough. The titles are clearly
> separate from the other data scraped by their being upper cased. We just
> need to expand our allowed character range into spaces and &'s.
>
> Nothing in the OP's request demands the kind of rigorous matching that
> your scenario does. Its a practical problem with a simple, practical
> answer.


Yes.  And simplicity + practicality = successfulality.

And I do a sanity check before using the data anyway: after parse and 
cleanup and regex matching, I make sure all lists have the same number 
of elements:

lenData = 
[len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)]

if len(set(lenData)) != 1:  alert the media





More information about the Python-list mailing list