[Tutor] removing line ends from Word text files
Lloyd Kvam
pythonTutor at venix.com
Sat Jul 17 21:25:08 CEST 2004
On Sat, 2004-07-17 at 12:54, David Rock wrote:
> * Michael Janssen <Janssen at rz.uni-frankfurt.de> [2004-07-17 15:55]:
> > On Fri, 9 Jul 2004, Christian Meesters wrote:
> >
> > > Right now I have the problem that I want to remove the MS Word line end
> > > token from text files: When saving a text file as 'text only' line ends
> > > are displayed as '^M' in a shell (SGI IRIX (tcsh) and Mac (tcsh or
> > > bash)). I want to get rid of these elements for further processing of
> > > the file and have no idea how to access them in a Python script. Any
> > > idea how to replace the '^M' against a simple '\n'? (I already tried
> > > '\r\n' and various other combinations of characters, but apparently all
> > > aren't '^M'.) '^M' is one character.
> >
> > You can allways ask Python when you want to know how it will represent
> > this character: Read one line with "readline" and print its repr-string:
> >
> > fo = open("filename")
> > line = fo.readline()
> > print repr(line)
> >
> > repr gives you an alternative string representation of any objects. repr
> > used on strings doesn't interpret backslash sequences like \n or \r. As
> > you are on MAC, I would guess your newline character is a simple "\r".
> >
> > you can also ask Python for the caracter's ordinal
> > print ord(line[-2]) # just in case one newline consists of two chars
> > print ord(line[-1])
> >
> > It's probably best to do such investigations with an interactive Python
> > session. But now since I've realized that readline is Unix-only, I don't
> > think interactive mode is that much fun on MAC/Win: without readline you
> > can't repeat your commands (without having to type them again and again).
> > You can't use the cursor keys. Perhaps Idle offers elaborate line editing
> > even on those systems.
>
> OK, a couple things...
> readline is NOT a Unix-only thing. I just tried it on my XP box and it's
> fine. open is also an older way of doing things with opening files, as
> of 2.2, file is probably what you want.
I too was shifting from open(...) to file(...), however, Guido is
recommending a change to the documentation and continued use of open.
http://mail.python.org/pipermail/python-dev/2004-July/045931.html
>
> http://www.python.org/doc/current/lib/built-in-funcs.html#l2h-25
>
> and for the sake of completeness, here is the info about built-in file
> objects:
> http://www.python.org/doc/current/lib/bltin-file-objects.html
>
> So this:
> fo = open("filename")
> line = fo.readline()
> print repr(line)
>
> becomes this:
> fo = file("filename")
> line = fo.readline()
> print repr(line)
>
> as for interactive Python, I have recently been introduced to ipython
> and it's great. It has a LOT of features that aren't in the normal
> shell:
> http://ipython.scipy.org/
>
> And finally, ^M is decimal 13 (hex 0D), \n is 10, and \r is 13 ...
> hmm, I guess that means ^M == \r
>
> One thing that I have used over the years to strip newline chars off
> lines is this, it's not the prettiest, but you'll get the idea:
>
> if '\n' in line:
> line = line[:-1]
> if '\r' in line:
> line = line[:-1]
I think
for c in "\r\n":
if line.endswith(c):
>
> basically, it's assuming (in the case of Windows) that the file ends
> with '\r\n', and strips them off one at a time.
--
Lloyd Kvam
Venix Corp.
1 Court Street, Suite 378
Lebanon, NH 03766-1358
voice: 603-653-8139
fax: 801-459-9582
More information about the Tutor
mailing list