extracting substrings from a file

Mon Sep 11 10:33:58 EDT 2006

On 11 Sep 2006 05:29:17 -0700, sofiafig at gmail.com <sofiafig at gmail.com> wrote:
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at   E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at     /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at  /GEN=bioB  /gb:J04423.1
>
> 1415785_a_at     /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?

If each entry is a single line, then the following is just to give you
some ideas.  It is not robust enough for "production" though.

The 2nd input line has 2 /gb fields, and your script would need to
have some way of knowing which one to pick.

>>> for x in s.splitlines():
... 	data = x.split()
... 	output = [ data[0] ]
... 	for z in data[1:]:
... 		if (z.startswith('/GEN') or z.startswith('/gb'))and z not in output:
... 			output.append(z)
... 	print ' '.join(output)
... 	
AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

HTH :)