Whittle it on down

alister alister.ware at ntlworld.com
Fri May 6 06:01:04 EDT 2016


On Thu, 05 May 2016 19:31:33 -0400, DFS wrote:

> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
> 
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs
>>>>> & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE
>>>>> & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS
>>>>> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us',
>>>>> 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy',
>>>>> 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login',
>>>>> 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
> 
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS &
>> GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE',
>> 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH
>> CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
> 
> 
> Should've looked earlier.  Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
> 
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
> comma.
> 
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
> 
> I updated your regex and it seems to have fixed it.
> 
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
> 
> 
> Thanks again.

it looks to me like this system is trying to prevent SQL injection 
attacks by blacklisting certain characters.
this is not the correct way to block such attacks & is probably not a 
good indicator to the quality of the rest of the application.



-- 
When love is gone, there's always justice.
And when justice is gone, there's always force.
And when force is gone, there's always Mom.
Hi, Mom!
		-- Laurie Anderson



More information about the Python-list mailing list