extracting substrings from a file

Mon Sep 11 10:12:51 EDT 2006

<sofiafig at gmail.com> wrote in message 
news:1157977756.841188.8550 at p79g2000cwp.googlegroups.com...
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at  /GEN=bioB  /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Here's a pyparsing solution that will address your immediate question, and 
also gives you some leeway for adding other "/" options to your search. 
Pyparsing's home page is at pyparsing.wikispaces.com.

-- Paul

data = """
AFFX-BioB-5_at E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
"""

from pyparsing import *

# create expression we are looking for:
#   name [ junk word... ] /qualifier...
name = Word(alphanums,printables).setResultsName("name")
junkWord = ~(Literal("/")) + Word(printables)
qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \
            oneOf("= :") + \
            Word(printables).setResultsName("value"))
expr = name + ZeroOrMore(junkWord) + \
            Dict(ZeroOrMore(qualifier)).setResultsName("quals")

# use parse action to repackage qualifier data to support "dict"-like
# access to qualifiers
qualifier.setParseAction( lambda t: (t.key,"".join(t)) )

# use this parse action instead if you just want whatever is
# after the '=' or ':' delimiter in the qualifier
# qualifier.setParseAction( lambda t: (t.key,t.value) )

# parse data strings, showing returned data structure
# (just to show what pyparsing results structure looks like)
for d in data.split("\n\n"):
    res = expr.parseString(d)
    print res.dump()
    print
print

# now just do what the OP wanted in the first place
for d in data.split("\n\n"):
    res = expr.parseString(d)
    print res.name, res.quals["gb"], res.quals["GEN"]

Gives these results:
['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb', 
'/gb:J04423.1')]]
- name: AFFX-BioB-5_at
- quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]
  - GEN: /GEN=bioB
  - gb: /gb:J04423.1

['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF', 
'/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), 
('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), 
('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', 
'/DEF=Mus')]]
- name: 1415785_a_at
- quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'), 
('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID', 
'/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG', 
'/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]
  - CNT: /CNT=482
  - DB_XREF: /DB_XREF=gi:6753327
  - DEF: /DEF=Mus
  - FEA: /FEA=FLmRNA
  - GEN: /GEN=Cct8
  - LL: /LL=12469
  - STK: /STK=281
  - TID: /TID=Mm.17989.1
  - TIER: /TIER=FL+Stack
  - UG: /UG=Mm.17989
  - gb: /gb:NM_009840.1

AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB
1415785_a_at /gb:NM_009840.1 /GEN=Cct8