Whittle it on down
Peter Otten
__peter__ at web.de
Fri May 6 03:45:18 EDT 2016
DFS wrote:
> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
>
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs &
>>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE &
>>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS',
>>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us',
>>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add
> /Update Listing', 'Business Profile Login', 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
>
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
>> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
>
>
> Should've looked earlier. Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
>
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>
> I updated your regex and it seems to have fixed it.
>
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
>
>
> Thanks again.
If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.
categories = set(master_list)
output = [category for category in input if category in categories]
You can find the categories with
>>> import urllib.request
>>> import bs4
>>> soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>> categories = set()
>>> for li in soup.find_all("li"):
... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...
>>> print("\n".join(sorted(categories)[:10]))
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services
More information about the Python-list
mailing list