newbe question about removing items from one file to another file

Mon Aug 28 02:03:29 EDT 2006

Eric_Dexter at msn.com wrote:
> def simplecsdtoorc(filename):
>     file = open(filename,"r")
>     alllines = file.read_until("</CsInstruments>")
>     pattern1 = re.compile("</")
>     orcfilename = filename[-3:] + "orc"
>     for line in alllines:
>         if not pattern1
>              print >>orcfilename, line
>
> I am pretty sure my code isn't close to what I want.  I need to be able
> to skip html like commands from <defined> to <undefined> and to key on
> another word in adition to </CsInstruments> to end the routine
>
> I was also looking at se 2.2 beta but didn't see any easy way to use it
> for this or for that matter search and replace where I could just add
> it as a menu item and not worry about it.
>
> thanks for any help in advance

If you're dealing with html or html-like files, do check out
beautifulsoup.  I had reason to use it the other day and man is it ever
useful!

Meantime, there are a few minor points about the code you posted:

1) open() defaults to 'r', you can leave it out when you call open() to
read a file.

2) 'file' is a builtin type (it's the type of file objects returned by
open()) so you shouldn't use it as a variable name.

3) file objects don't have a read_until() method.  You could say
something like:

f = open(filename)
lines = []
for line in f:
    lines.append(line)
    if '</CsInstruments>' in line:
        break

4) filename[-3:] will give you the last 3 chars in filename.  I'm
guessing that you want all but the last 3 chars, that's filename[:-3],
but see the os.path.splitext() function, and indeed the  other
functions in os.path too:
http://docs.python.org/lib/module-os.path.html

5) the regular expression objects returned by re.compile() will always
evaluate True, so you want to call their search() method on the data to
search:

if not pattern1.search(line):

But, 6) using re for a pattern as simple as "</" is way overkill.  Just
use 'in' or the find() method of strings:

if "</" not in line:

or:

pos = line.find("</")
if pos == -1:
    print >>orcfilename, line
else:
    print >>orcfilename, line[:pos]

7) the "print >> file" usage requires a file (or file-like object,
anything with a write() method I think) not a string.  You need to use
it like this:

orcfile = open(orcfilename, 'w')
#...
print >> orcfile, line

8) If you have a list of lines anyway, you can use the writelines()
method of files to write them in one go:

open(orcfilename, 'w').writelines(lines)

of course stripping out your unwanted data from that last line using
find() as shown above.

I hope this helps.

Check out the docs on file objects:
http://docs.python.org/lib/bltin-file-objects.html,  but like I said,
if you're dealing with html or html-like files, be sure to check out
beautifulsoup.  Also, there's the elementtree package for parsing XML
that could help here too.

~Simon