extracting substrings from a file

John Machin sjmachin at lexicon.net
Mon Sep 11 09:57:10 EDT 2006


sofiafig at gmail.com wrote:
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at	 E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at	 /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at  /GEN=bioB  /gb:J04423.1
>
> 1415785_a_at	 /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?
>
> Many thanks in advance
> Sofia

Here's my first iteration:
C:\junk>type sofia.py
prefixes = ['/GEN=', '/gb:']

def extract(fname):
    f = open(fname, 'r')
    chunks = [[]]
    for line in f:
        words = line.split()
        if words:
            chunks[-1].extend(words)
        else:
            chunks.append([])
    for chunk in chunks:
        if not chunk:
            continue
        output = [chunk[0]]
        for word in chunk:
            for prefix in prefixes:
                if word.startswith(prefix):
                    output.append(word)
                    break
        print ' '.join(output)

if __name__ == "__main__":
    import sys
    extract(sys.argv[1])

C:\junk>sofia.py sofia.txt
AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1
1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

Before I fix the duplicate in the first line, you need to say whether
you really want the
/gb:BC009007.1 in the second line thrown away -- IOW, what's the rule?
For each prefix, either (1) get the first "word" that starts with that
prefix or (2) get all unique such words. You choose.

Cheers,
John




More information about the Python-list mailing list