Detecting line endings
Fuzzyman
fuzzyman at gmail.com
Mon Feb 6 16:56:08 EST 2006
Sybren Stuvel wrote:
> Fuzzyman enlightened us with:
> > My worry is that if '\n' *doesn't* signify a line break on the Mac,
> > then it may exist in the body of the text - and trigger ``ending =
> > '\n'`` prematurely ?
>
> I'd count the number of occurences of '\r\n', '\n' without a preceding
> '\r' and '\r' without following '\n', and let the majority decide.
>
This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.
Comments/corrections welcomed. I know the tests aren't very useful
(because they make no *assertions* they won't tell you if it breaks),
but you can see what's going on :
import re
import os
rn = re.compile('\r\n')
r = re.compile('\r(?!\n)')
n = re.compile('(?<!\r)\n')
# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]
def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.
Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.
If there is a tie between two endings, the priority chain is
``'\n', '\r\n', '\r'``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]
if __name__ == '__main__':
tests = [
'hello\ngoodbye\nmy fish\n',
'hello\r\ngoodbye\r\nmy fish\r\n',
'hello\rgoodbye\rmy fish\r',
'hello\rgoodbye\n',
'',
'\r\r\r \n\n',
'\n\n \r\n\r\n',
'\n\n\r \r\r\n',
'\n\r \n\r \n\r',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
print
All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
> Sybren
> --
> The problem with the world is stupidity. Not saying there should be a
> capital punishment for stupidity, but why don't we just take the
> safety labels off of everything and let the problem solve itself?
> Frank Zappa
More information about the Python-list
mailing list