[Python-bugs-list] [Bug #124051] ndiff bug: "?" lines are out-of-sync

Thu, 7 Dec 2000 02:38:14 -0800

Bug #124051, was updated on 2000-Dec-01 07:17
Here is a current snapshot of the bug.

Project: Python
Category: demos and tools
Status: Closed
Resolution: Invalid
Bug Group: Not a Bug
Priority: 5
Submitted by: flight
Assigned to : tim_one
Summary: ndiff bug: "?" lines are out-of-sync

Details: I wonder if this result (the "?" line) of ndiff is intentional:

clapton:1> cat a
Millionen für so 'n Kamelrennen sind
clapton:2> cat b
Millionen für so "n Kamelrennen sind
clapton:3> /tmp/ndiff.py -q a b
- Millionen für so 'n Kamelrennen sind
+ Millionen für so "n Kamelrennen sind
?                  ^

clapton:4> cat c
Millionen deren für so "n Kamelrennen sind
clapton:5> /tmp/ndiff.py -q a c
- Millionen für so 'n Kamelrennen sind
+ Millionen deren für so "n Kamelrennen sind
?           ++++++ -     +

Instead of a - and a subsequent +, I would expect to find here a ^, too.

Follow-Ups:

Date: 2000-Dec-03 19:11
By: tim_one

Comment:
A caret means that the character in the line two above and in the same column was replaced by the character in the line one above and in the same column.  That's why you get a caret in the first example but not the second:  the replacement involves two distinct columns.

If you did get a caret in the second example, where would it go?  If under the single quote from the line two above, it would look the single quote got replaced by the ü in für; if under the double quote from the line one above, like the first e in Kamelrennen got replaced by a double quote.  Both readings would be wrong.

Edit sequences aren't unique, and in the absence of an obvious and non-ambiguous way to show replacements across columns, ndiff settles for a *correct* sequence ("deren " was inserted, "'" was deleted, '"' was inserted).  In this respect ndiff is functioning as designed, so it's not a bug.

-------------------------------------------------------

Date: 2000-Dec-07 02:38
By: flight

Comment:
[Is such a long comment still appropriate for the SF BTS ?]

Tim, could you please explain the meaning of the remaining symbols (plus,
minus) as well ? I think their meaning is far from being intuitive, then.

> A caret means that the character in the line two above and in the same
> column was replaced by the character in the line one above and in the same
> column.

How about this example, then ? Why is there a caret ?

freefly;44> cat a
1 2 3 5
freefly;45> cat b
1 3 4 5
freefly;46> ./ndiff.py -q a b
- 1 2 3 5
+ 1 3 4 5
?   -^+

Sorry, but i have the impression that the format used in the edit lines is
indeed ambigous by definition.

> That's why you get a caret in the first example but not the
> second: the replacement involves two distinct columns.

> Edit sequences aren't unique, and in the absence of an obvious and
> non-ambiguous way to show replacements across columns, ndiff settles for a
> *correct* sequence ("deren " was inserted, "'" was deleted, '"' was
> inserted).  In this respect ndiff is functioning as designed, so it's not a
> bug.

Please describe the intended meaning of '+' and '-', and I will give you an
counter-example that ndiff.py doesn't output a correct sequence for.

I think it's especially annoying that the edit line doesn't reflect the
information that the algorithm used in fancy_replace generates (if you run
my first example, the algorithm will in fact record an 'replace' event, but
the output routine will degenerate this into an 'insert' and a 'delete'
event.

Resp. uniqueness and ambiguity: It depends on the definition of an edit
line. You won't find a definition that keeps the edit line in sync
(column-wise) with both the pre and the post lines.

If you try to keep the edit line in sync (column-wise) with the pre line,
that's fine for '^' (meaning: character in this column has been changed) and
'-' (meaning: character in this column has been removed), but you won't be
able to record '+' events, since there's no column in the pre line where a
'+' event might be recorded.

(Similarly, if you tried to keep the edit line in sync with the post line.)

- one two three four five six seven
+ one three fxur 123456 five 987 six seven
?    ----    +  +^+++++      ++++

One way to work around this would be to output two edit lines: A pre-edit
line would be synced (column-wise) with the pre line, and it would record
all '-' and '^' events. A post-edit line would record all '+' and '^'
events, and would be in sync with the post line. Unambigous and quite
intuitive:

  - one two three four five six seven
  ?    ----        ^                 
  + one three fxur 123456 five 987 six seven
  ?            ^  +++++++     ++++

A second way to define an unambigous edit line format (but not really
friendly to eyeball inspection) would be to use the pre-edit line described
above, and, in a second step to merge the '+' sequences at the respective
places. This format would allow for easy automatic extraction of all the
information generated by fancy_replace. In fact this is what I expected too
see.

- one two three four five six seven
+ one three fxur 123456 five 987 six seven
?    ----        ^  +++++++     ++++          

A third way would be to insert spaces or some other placeholder in the pre
line in the columns with 'insert' events and in the post line in the columns
with 'delete' events. Easy for eyeball inspection, but it doesn't ouput the
original lines.

- one two three four_______ five____ six seven
+ one     three fxur 123456 five 987 six seven
?    ------      ^  +++++++     ++++          

A final way would be to use a format like wdiff, where the insert and
replace tags are placed in the line:

one[- two-] three four{+ 123456+} five{+ 987+} six seven

If you ask me, either of these formats is better than the one currently
used, which is only reliable for short lines with small differences.

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=124051&group_id=5470