Whittle it on down

Fri May 6 10:41:30 EDT 2016

On 5/6/2016 9:58 AM, DFS wrote:
> On 5/6/2016 3:45 AM, Peter Otten wrote:
>> DFS wrote:
>
>>> Should've looked earlier.  Their master list of categories
>>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>>> and the ampersands we talked about.
>>>
>>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
>>> comma.
>>>
>>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>>
>>> I updated your regex and it seems to have fixed it.
>>>
>>> orig: (r"^[A-Z\s&]+$")
>>> new : (r"^[A-Z\s&,-]+$")
>>>
>>>
>>> Thanks again.
>>
>> If there is a "master list" compare your candidates against it instead of
>> using a heuristic, i. e.
>>
>> categories = set(master_list)
>> output = [category for category in input if category in categories]
>>
>> You can find the categories with
>>
>>>>> import urllib.request
>>>>> import bs4
>>>>> soup =
>> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>
>>>>> categories = set()
>>>>> for li in soup.find_all("li"):
>> ...     assert li.parent.parent["class"][0].startswith("category_items")
>> ...     categories.add(li.text)
>> ...
>>>>> print("\n".join(sorted(categories)[:10]))
>
>
>
> "import urllib.request
> ImportError: No module named request"

Figured it out using urllib2.  Your code returns 411 categories from 
that first page.

There are up to 4 levels of categorization:

Level 1: Arts & Entertainment
Level 2:   Newspapers

Level 3:     Newspaper Brokers
Level 3:     Newspaper Dealers Back Number
Level 3:     Newspaper Delivery
Level 3:     Newspaper Distributors
Level 3:     Newsracks
Level 3:     Printers Newspapers
Level 3:     Newspaper Dealers

Level 3:     News Dealers
Level 4:       News Dealers Wholesale
Level 4:       Shoppers News Publications

Level 3:     News Service
Level 4:       Newspaper Feature Syndicates
Level 4:       Prepress Services

http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 
Level 2.  To get the Level 3 and 4 you have to drill-down using the 
hyperlinks.

How to do it in python code is beyond my skills at this point.  Get the 
hrefs and load them and parse, then get the next level and load them and 
parse, etc.?