Extracting lines from text files - script with a couple of 'side effects'
MRAB
python at mrabarnett.plus.com
Wed Sep 25 17:53:52 EDT 2013
On 25/09/2013 21:06, mstagliamonte wrote:
> Dear All,
>
> Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
> Fasta file:
>>header1|info1:info2_info3
> general text
>>header2|info1:info2_info3
> general text
>
> headers file:
> header1|info1:info2_info3
> header2|info1:info2_info3
>
> I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
>
> I have done a script which seems to work, but with a couple of 'side effects'
> Here is my script:
> -------------------------------------------------------------------
> import re
> class Extractor():
>
> def __init__(self,headers_file, fasta_file,output_file):
> with open(headers_file,'r') as inp0:
> counter0=0
> container=''
> inp0_bis=inp0.read().split('\n')
> for x in inp0_bis:
> container+=x.replace(':','_').replace('|','_')
> with open(fasta_file,'r') as inp1:
> inp1_bis=inp1.read().split('>')
> for i in inp1_bis:
> i_bis= i.split('\n')
> match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
> if match:
> counter0+=1
> with open(output_file,'at') as out0:
> out0.write('>'+i)
> print '{} sequences were found'.format(counter0)
>
> -------------------------------------------------------------------
> Side effects:
> 1) The very first header is written as >>header1 rather than >header1
> 2) the number of sequences found is 1 more than the ones actually found!
>
> Have you got any thoughts about causes/solutions?
>
> Thanks for your time!
>
Here's my version:
class Extractor():
def __init__(self, headers_file, fasta_file, output_file):
with open(headers_file) as inp:
headers = set('>' + line for line in inp)
counter = 0
accept = False
with open(fasta_file) as inp, open(output_file, 'w') as out:
for line in inp:
if line.startswith('>'):
accept = line in headers
if accept:
counter += 1
if accept:
out.write(line)
print '{} sequences were found'.format(counter)
More information about the Python-list
mailing list