Whittle it on down
DFS
nospam at dfs.com
Fri May 6 09:58:34 EDT 2016
On 5/6/2016 3:45 AM, Peter Otten wrote:
> DFS wrote:
>> Should've looked earlier. Their master list of categories
>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>> and the ampersands we talked about.
>>
>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>>
>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>
>> I updated your regex and it seems to have fixed it.
>>
>> orig: (r"^[A-Z\s&]+$")
>> new : (r"^[A-Z\s&,-]+$")
>>
>>
>> Thanks again.
>
> If there is a "master list" compare your candidates against it instead of
> using a heuristic, i. e.
>
> categories = set(master_list)
> output = [category for category in input if category in categories]
>
> You can find the categories with
>
>>>> import urllib.request
>>>> import bs4
>>>> soup =
> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>>> categories = set()
>>>> for li in soup.find_all("li"):
> ... assert li.parent.parent["class"][0].startswith("category_items")
> ... categories.add(li.text)
> ...
>>>> print("\n".join(sorted(categories)[:10]))
"import urllib.request
ImportError: No module named request"
I'm on python 2.7.11
> Accounting & Bookkeeping Services
> Adoption Services
> Adult Entertainment
> Advertising
> Agricultural Equipment & Supplies
> Agricultural Production
> Agricultural Services
> Aids Resources
> Aircraft Charters & Rentals
> Aircraft Dealers & Services
Yeah, I actually did something like that last night. Was trying to get
their full tree structure, which goes 4 levels deep: ie
Arts & Entertainment
Newpapers
News Dealers
Prepess Services
What I referred to as their 'master list' is actually just 2 levels
deep. My bad.
So far I haven't come across one that had anything in it but letters,
dashes, commas or ampersands.
Thanks
More information about the Python-list
mailing list