[issue46667] SequenceMatcher & autojunk - false negative
Jonathan
report at bugs.python.org
Mon Feb 7 10:01:52 EST 2022
Jonathan <bugreports at lightpear.com> added the comment:
I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected.
Given this, I'd suggest the following then:
* `Autojunk=False` should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons.
* If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information".
* Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start:
"Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)"
Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue46667>
_______________________________________
More information about the Python-bugs-list
mailing list