extracting a substring

Tue Apr 18 20:46:08 EDT 2006

Em Ter, 2006-04-18 às 17:25 -0700, b83503104 at yahoo.com escreveu:
> Hi,
> I have a bunch of strings like
> a53bc_531.txt
> a53bc_2285.txt
> ...
> a53bc_359.txt
> 
> and I want to extract the numbers 531, 2285, ...,359.

Some ways:

1) Regular expressions, as you said:
>>> from re import compile
>>> find = compile("a53bc_([1-9]*)\\.txt").findall
>>> find('a53bc_531.txt\na53bc_2285.txt\na53bc_359.txt')
['531', '2285', '359']

2) Using ''.split:
>>> [x.split('.')[0].split('_')[1] for x in 'a53bc_531.txt
\na53bc_2285.txt\na53bc_359.txt'.splitlines()]
['531', '2285', '359']

3) Using indexes (be careful!):
>>> [x[6:-4] for x in 'a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt'.splitlines()]
['531', '2285', '359']

Measuring speeds:

$ python2.4 -m timeit -s 'from re import compile; find =
compile("a53bc_([1-9]*)\\.txt").findall; s = "a53bc_531.txt
\na53bc_2285.txt\na53bc_359.txt"' 'find(s)'
100000 loops, best of 3: 3.03 usec per loop

$ python2.4 -m timeit -s 's = "a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n"[:-1]' "[x.split('.')[0].split('_')[1] for x in
s.splitlines()]"
100000 loops, best of 3: 7.64 usec per loop

$ python2.4 -m timeit -s 's = "a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n"[:-1]' "[x[6:-4] for x in s.splitlines()]"
100000 loops, best of 3: 2.47 usec per loop

$ python2.4 -m timeit -s 'from re import compile; find =
compile("a53bc_([1-9]*)\\.txt").findall; s = ("a53bc_531.txt
\na53bc_2285.txt\na53bc_359.txt\n"*1000)[:-1]' 'find(s)'
1000 loops, best of 3: 1.95 msec per loop

$ python2.4 -m timeit -s 's = ("a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n" * 1000)[:-1]' "[x.split('.')[0].split('_')[1] for x
in s.splitlines()]"
100 loops, best of 3: 6.51 msec per loop

$ python2.4 -m timeit -s 's = ("a53bc_531.txt\na53bc_2285.txt
\na53bc_359.txt\n" * 1000)[:-1]' "[x[6:-4] for x in s.splitlines()]"
1000 loops, best of 3: 1.53 msec per loop

Summary: using indexes is less powerful than regexps, but faster.

HTH,

-- 
Felipe.