[Tutor] R: Tutor Digest, Vol 125, Issue 49

Wed Jul 16 22:18:00 CEST 2014

On 16.07.2014 10:04, jarod_v6 at libero.it wrote:
> Hi there!!!
> I have a file  with this data
> ['uc002uvo.3 ', 'uc001mae.1']
> ['uc010dya.2 ', 'uc001kko.2']
> ['uc003ejx.2 ', 'uc010yfr.1']
> ['uc001bhk.2 ', 'uc003eib.2']
> ['uc001znc.2 ', 'uc001efn.2']
> ['uc002ycq.2 ', 'uc001vnh.2']
> ['uc001odf.1 ', 'uc002mwd.2']
> ['uc010jkn.1 ', 'uc010luk.1']
> ['uc003uhf.3 ', 'uc010tqd.1']
> ['uc002rue.3 ', 'uc001tex.2']
> ['uc011dtt.1 ', 'uc001lkv.1']
> ['uc003yyt.2 ', 'uc003mkl.2']
> ['uc003pkv.2 ', 'uc003ytw.2']
> ['uc010bhz.2 ', 'uc002kbt.1']
> ['uc001wnj.2 ', 'uc009wtj.1']
> ['uc011lyh.1 ', 'uc003jvb.2']
> ['uc002awj.1 ', 'uc009znm.1']
> ['uc010bft.2 ', 'uc002cxz.1']
> ['uc011mar.1 ', 'uc001lvb.1']
> ['uc001oxl.2 ', 'uc002lvx.1']
>
> I want to replace of the things after the dots, so I want to have  a file with
> this output:
>
> ['uc002uvo ', 'uc001mae']
> ['uc010dya ', 'uc001kko']
> ...
>
> I try to use regular expression but I have  a strange output
>
> with open("non_annotati.csv") as p:
>      for i in p:
>          lines= i.rstrip("\n").split("\t")

lines is not the best variable name why not use:
            gene1, gene2 = i.rstrip("\n").split("\t")

>          mit = re.sub(r'(\.\d$)','',lines[0])
>          mit2 = re.sub(r'(\.\d$)','',lines[1])
>          print mit,mit2
>

While Danny has pointed out the actual reason why your code is not 
working with this specific input data, it's generally not a good idea to 
make too specific assumptions about input formatting by specifying '\n' 
and ’\t' explicitly when all you want to do is to eliminate whitespace:

 >>> help(s.split)
Help on built-in function split:

split(...) method of builtins.str instance
     S.split(sep=None, maxsplit=-1) -> list of strings

     Return a list of the words in S, using sep as the
     delimiter string.  If maxsplit is given, at most maxsplit
     splits are done. If sep is not specified or is None, any
     whitespace string is a separator and empty strings are
     removed from the result.

 >>> s='uc002uvo.3 \tuc001mae.1\r\n'  # Windows line breaks
 >>> s.split()
['uc002uvo.3', 'uc001mae.1']

and I agree with Joel that re is overkill here. In fact, your current 
regexp will fail with two digit numbers after the dot though I don't 
know whether such names can occur in your data.

Best,
Wolfgang