splitting a long string into a list

Cameron Walsh cameron.walsh at gmail.com
Tue Nov 28 00:56:29 EST 2006


ronrsr wrote:
> I have a single long string - I'd like to split it into a list of
> unique keywords. Sadly, the database wasn't designed to do this, so I
> must do this in Python - I'm having some trouble using the .split()
> function, it doesn't seem to do what I want it to - any ideas?
> 
> thanks very much for your help.
> 
> r-sr-
> 
> 
> longstring = '<SNIP rather long string>'

What do you want it to do?  Split on each semicolon?

a = longstring.split(";")
for element in a:
  print element

Agricultural subsidies
 Foreign aidAgriculture
Sustainable Agriculture - Support
 Organic Agriculture
 Pesticides, US,Childhood Development, Birth Defects
 Toxic ChemicalsAntibiotics,AnimalsAgricultural Subsidies, Global
TradeAgriculturalSubsidiesBiodiversityCitizen
ActivismCommunityGardensCooperativesDietingAgriculture,
CottonAgriculture, GlobalTradePesticides, MonsantoAgriculture,
SeedCoffee, HungerPollution,Water, FeedlotsFood PricesAgriculture,
WorkersAnimal Feed,
Corn,PesticidesAquacultureChemicalWarfareCompostDebtConsumerismFearPesticides,
US, Childhood Development,Birth DefectsCorporate Reform,  Personhood
(Dem. Book)Corporate Reform, Personhood, Farming (Dem. Book)Crime Rates,
Legislation,EducationDebt, Credit CardsDemocracyPopulation,
WorldIncomeDemocracy,Corporate Personhood, Porter Township (Dem.
Book)DisasterReliefDwellings, SlumsEconomics, MexicoEconomy,
LocalEducation,ProtestsEndangered Habitat, RainforestEndangered
SpeciesEndangeredSpecies, Extinctionantibiotics, livestockAgricultural
subsidies
Foreign aid
Agriculture
 Sustainable Agriculture - Support
 OrganicAgriculture
 Pesticides, US, Childhood Development, Birth Defects
Toxic Chemicals
<etc.>

I think the problem arises because your string has the following problems:

1.)  Inconsistent spaces between words (some are non-existent)
2.)  Inconsistent separators between elements (sometimes semi-colons,
sometimes commas, but commas appear to belong to elements, sometimes no
clear separator at all)

Basically, this problem is not solvable by computer with currently
available resources.  There is no way Python or anything else can know
which words are meant to be together and which are not, when there are
no separators between elements and no separators between words within
those elements.

You need to find a new way of generating the string, or do it by hand.

How did you get the string?

Cameron.



More information about the Python-list mailing list