Whittle it on down
DFS
nospam at dfs.com
Fri May 6 10:41:30 EDT 2016
On 5/6/2016 9:58 AM, DFS wrote:
> On 5/6/2016 3:45 AM, Peter Otten wrote:
>> DFS wrote:
>
>>> Should've looked earlier. Their master list of categories
>>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>>> and the ampersands we talked about.
>>>
>>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
>>> comma.
>>>
>>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>>
>>> I updated your regex and it seems to have fixed it.
>>>
>>> orig: (r"^[A-Z\s&]+$")
>>> new : (r"^[A-Z\s&,-]+$")
>>>
>>>
>>> Thanks again.
>>
>> If there is a "master list" compare your candidates against it instead of
>> using a heuristic, i. e.
>>
>> categories = set(master_list)
>> output = [category for category in input if category in categories]
>>
>> You can find the categories with
>>
>>>>> import urllib.request
>>>>> import bs4
>>>>> soup =
>> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>
>>>>> categories = set()
>>>>> for li in soup.find_all("li"):
>> ... assert li.parent.parent["class"][0].startswith("category_items")
>> ... categories.add(li.text)
>> ...
>>>>> print("\n".join(sorted(categories)[:10]))
>
>
>
> "import urllib.request
> ImportError: No module named request"
Figured it out using urllib2. Your code returns 411 categories from
that first page.
There are up to 4 levels of categorization:
Level 1: Arts & Entertainment
Level 2: Newspapers
Level 3: Newspaper Brokers
Level 3: Newspaper Dealers Back Number
Level 3: Newspaper Delivery
Level 3: Newspaper Distributors
Level 3: Newsracks
Level 3: Printers Newspapers
Level 3: Newspaper Dealers
Level 3: News Dealers
Level 4: News Dealers Wholesale
Level 4: Shoppers News Publications
Level 3: News Service
Level 4: Newspaper Feature Syndicates
Level 4: Prepress Services
http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
Level 2. To get the Level 3 and 4 you have to drill-down using the
hyperlinks.
How to do it in python code is beyond my skills at this point. Get the
hrefs and load them and parse, then get the next level and load them and
parse, etc.?
More information about the Python-list
mailing list