How to ignore white space changes using difflib?

Wed Apr 8 13:16:13 EDT 2009

Grant Edwards <invalid at invalid> wrote:

> Apparently that "filtering out" characters doesn't mean that
> they're ignored when doing the comparison.  (A bit of a "WTF?"
> if you ask me).  After some more googling, it appears that I'm
> far from the first person who interpreted "filtered out" as
> "ignored when comparing lines". I'd submit a fix for the doc
> page, but you apparently have to be a lot smarter than me to
> figure out what "filters out" means in this context.

So far as I can see from looking at the code:

Once if you have identified one block of lines as having been replaced by 
another the matcher can then give you additional information by marking up 
the changes within each line. However it only makes sense to do that if the 
lines are still somewhat similar.

'charjunk' is used to remove junk characters before scanning the lines 
within a replacement block and the most similar lines (if they are 
sufficiently similar) are then chosen for this extra step of comparing the 
character changes within the line.

Here's an example. If I do this:

>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
                                   'one\nwot\ntoo\nthree\n'.splitlines(1)))
  one
- two
? -
+ wot
?   +
+ too
  three

The comparison detected that "two" was replaced by 2 lines "wot" and "too". 
It decided the first of these was the best match for the original line so 
it shows character level difference between the original and the first 
replacement line.

>>> print ''.join(Differ(charjunk=lambda c:c=='w')
      .compare('one\ntwo\nthree\n'.splitlines(1),
                                   'one\nwot\ntoo\nthree\n'.splitlines(1)))
  one
+ wot
- two
?  ^
+ too
?  ^
  three

This time we told the system that we don't care about 'w' in either the 
original or replacement text. That means instead of seeing which of "wot" 
and "too" is closest to "two" it looks to see which of "ot" and "too" is 
closest to "to". "ot" has two changes but "too" only has one, so this time 
it does the detailed comparison between the original line and the second 
line of the output. N.B. The junk function is only used to decide which 
lines to use for the detailed comparison: the original lines are still used 
for the comparison itself.