Whittle it on down

DFS nospam at dfs.com
Thu May 5 19:31:33 EDT 2016


On 5/5/2016 1:39 AM, Stephen Hansen wrote:

> Given:
>
>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']
>
> Then:
>
>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>> output = [x for x in list if pattern.match(x)]
>>>> output

> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']


Should've looked earlier.  Their master list of categories 
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, 
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.





More information about the Python-list mailing list