[issue31561] difflib pathological behavior with mixed line endings

Mahmoud Al-Qudsi report at bugs.python.org
Sun Sep 24 15:27:19 EDT 2017


Mahmoud Al-Qudsi added the comment:

@tim.peters

No, `icdiff` is not part of core and probably should be omitted from the remainder of this discussion.

I just checked and it's actually not a mix of line endings in each file, it's just that one file is \n and the other is \r\n

You can actually just duplicate this bug by taking _any_ file and copying it, then executing `unix2dos file1; dos2unix file2` - you'll have to perfectly "correct" files2 that difflib will struggle to handle.

(as a preface to what follows, I've written a binary diff and incremental backup utility, so I'm familiar with the intricacies and pitfalls when it comes to diffing. I have not looked at difflib's source code, however. Looking at the documentation for difflib, it's not clear whether or not it should be considered a naive binary diffing utility, since it does seem to have the concept of "lines".)

Given that _both_ input files are "correct" without line ending errors, I think the correct optimization here would be for difflib to "realize" that two chunks are "identical" but with different line endings (aka just plain different, not asking for this to be treated as a special case) but instead of going on to search for a match to either buffer, it should assume that no better match will be found later on and simply move on to the next block/chunk.

Of course, in the event where file2 has a line from file1 that is first present with a different line ending then repeated with the same line ending, difflib will not choose the correct line.. but that's probably not something worth fretting over (like you said, mixed line endings == recipe for disaster).

Of course I can understand if all this is out of the scope of difflib and not an endeavor worth taking up.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue31561>
_______________________________________


More information about the Python-bugs-list mailing list