Whittle it on down

DFS nospam at dfs.com
Fri May 6 09:58:34 EDT 2016


On 5/6/2016 3:45 AM, Peter Otten wrote:
> DFS wrote:

>> Should've looked earlier.  Their master list of categories
>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>> and the ampersands we talked about.
>>
>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>>
>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>
>> I updated your regex and it seems to have fixed it.
>>
>> orig: (r"^[A-Z\s&]+$")
>> new : (r"^[A-Z\s&,-]+$")
>>
>>
>> Thanks again.
>
> If there is a "master list" compare your candidates against it instead of
> using a heuristic, i. e.
>
> categories = set(master_list)
> output = [category for category in input if category in categories]
>
> You can find the categories with
>
>>>> import urllib.request
>>>> import bs4
>>>> soup =
> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>>> categories = set()
>>>> for li in soup.find_all("li"):
> ...     assert li.parent.parent["class"][0].startswith("category_items")
> ...     categories.add(li.text)
> ...
>>>> print("\n".join(sorted(categories)[:10]))



"import urllib.request
ImportError: No module named request"


I'm on python 2.7.11





> Accounting & Bookkeeping Services
> Adoption Services
> Adult Entertainment
> Advertising
> Agricultural Equipment & Supplies
> Agricultural Production
> Agricultural Services
> Aids Resources
> Aircraft Charters & Rentals
> Aircraft Dealers & Services




Yeah, I actually did something like that last night.  Was trying to get
their full tree structure, which goes 4 levels deep: ie

Arts & Entertainment
   Newpapers
    News Dealers
     Prepess Services


What I referred to as their 'master list' is actually just 2 levels 
deep.  My bad.

So far I haven't come across one that had anything in it but letters, 
dashes, commas or ampersands.

Thanks



More information about the Python-list mailing list