extracting substrings from a file

Mon Sep 11 10:02:20 EDT 2006

sofiafig at gmail.com wrote:
> Hi,
> 
> I have a file with several entries in the form:
> 
> AFFX-BioB-5_at	 E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
> 
> 1415785_a_at	 /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
> 
> and I would like to create a file that has only the following:
> 
> AFFX-BioB-5_at  /GEN=bioB  /gb:J04423.1
> 
> 1415785_a_at	 /gb:NM_009840.1 /GEN=Cct8
> 
> Could anyone please tell me how can I do it?
> 
> Many thanks in advance
> Sofia
> 
What have your tried so far?

Hint: split line on spaces, the first pieces is the first item you want,
then iterate over the pieces looking for the /GEN and /gb: pieces that
you are interested in keeping.  I am assuming that /GEN= and /gb: data
doesn't have any spaces in them.  If they do, you will need to use
regular expressions instead of split.

-Larry Bates