Using difflib to compare text ignoring whitespace differences

Wed Dec 20 23:52:41 EST 2006

On 19 dic, 11:53, Neilen Marais <nmar... at sun.ac.za> wrote:
> Hi
>
> I'm trying to compare some text to find differences other than whitespace.
> I seem to be misunderstanding something, since I can't even get a basic
> example to work:
>
> In [104]: d =difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK)
>
> In [105]: list(d.compare(['  a'], ['a']))
> Out[105]: ['-   a', '+ a']
>
> Surely if whitespace characters are being ignored those two strings should
> be marked as identical? What am I doing wrong?

The docs for Differ are a bit terse and misleading.
compare() does a two-level matching: first, on a *line* level,
considering only the linejunk parameter. And then, for each pair of
similar lines found on the first stage, it does a intraline match
considering only the charjunk parameter.
Also note that junk!=ignored, the algorithm tries to "find the longest
contiguous matching subsequence that contains no ``junk'' elements"

Using a slightly longer text gets closer to what you want, I think:

d=difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK)
for delta in d.compare(['   a larger line'],['a longer line']): print
delta

-    a larger line
? ---   ^^

+ a longer line
?    ^^

-- 
Gabriel Genellina