Generating list of unique search sub-phrases

Nick Mellor thebalancepro at gmail.com
Wed Jun 17 18:55:35 EDT 2015


On Saturday, 30 May 2015 06:39:44 UTC+10, Nick Mellor  wrote:
> Hi all,
> 
> My own solution works but I'm sure it could be simpler or read better. How would you do it?
> 
> Say you've got a list of companies:
> 
> Aerosonde Ltd
> Amcor
> ANCA
> Austal Ships
> Australia Post
> Australian Air Express
> Australian Defence Industries
> Australian Railroad Group
> Australian Submarine Corporation
> 
> and you need to extract phrases from the company names that uniquely identify that company. The results for the above list of companies should be:
> 
> Company: 'Aerosonde Ltd'
>  Aliases: Aerosonde,Ltd,Aerosonde Ltd
> 
> Company: 'Amcor'
>  Aliases: Amcor
> 
> Company: 'ANCA'
>  Aliases: ANCA
> 
> Company: 'Austal Ships'
>  Aliases: Austal,Ships,Austal Ships
> 
> Company: 'Australia Post'
>  Aliases: Post,Australia Post
> 
> Company: 'Australian Air Express'
>  Aliases: Air,Express,Australian Air,Air Express,Australian Air Express
> 
> Company: 'Australian Defence Industries'
>  Aliases: Defence,Industries,Australian Defence,Defence Industries,Australian Defence Industries
> 
> Company: 'Australian Railroad Group'
>  Aliases: Railroad,Group,Australian Railroad,Railroad Group,Australian Railroad Group
> 
> Company: 'Australian Submarine Corporation'
>  Aliases: Submarine,Corporation,Australian Submarine,Submarine Corporation,Australian Submarine Corporation
> 
> Here's my solution:
> 
> from itertools import combinations, chain
> 
> companies = [
>     "Aerosonde Ltd",
>     "Amcor",
>     "ANCA",
>     "Austal Ships",
>     "Australia Post",
>     "Australian Air Express",
>     "Australian Defence Industries",
>     "Australian Railroad Group",
>     "Australian Submarine Corporation",
> ]
> 
> def flatten(i):
>     return list(chain.from_iterable(i))
> 
> companies_as_text_stream = ' '.join(companies)
> for company in companies:
>         word_combinations = [list(combinations(company.split(), r)) for r in range(1, len(company))]
>         phrases = [' '.join(phrase) for phrase in flatten(word_combinations)]
>         unique_phrases = [phrase for phrase in phrases if companies_as_text_stream.count(phrase) == 1]
>         aliases = ','.join(unique_phrases)
>         print("Company: '{0}'\n Aliases: {1}\n".format(company, aliases))

Great reply, Peter, thank you. Lots to think about.

Cheers,

Nick



More information about the Python-list mailing list