Problem with difflib SequenceMatcher

Alain Ketterlin alain at universite-de-strasbourg.fr.invalid
Mon Sep 12 08:18:29 EDT 2016


Jay <jay.sridhar at gmail.com> writes:

> I am having an odd problem with difflib.SequenceMatcher. Sample code below:
>
> The strings "src" and "trg" differ only a little.

How exactly? (Please be precise, it helps testing.)

> The SequenceMatcher.ratio() for these strings 0.0. Many other similar
> strings are working fine without problems (see below) with non-zero
> ratios depending on how much difference there is between strings (as
> expected).

Calling SM(...,trg[1:],src[1:]) gives plausible result. See also the
result of .get_matching_blocks() on your strings (it returns no matching
blocks).

It is all due to the "Autojunk" heuristics (see difflib's doc for
details), which considers the first characters as junk. Call
SM(...,autojunk=False).

I have no idea why the maintainers made this stupid autojunk idea the
default. Complain with them.

-- Alain.

> Tested on Python 2.7 on Ubuntu 14.04
>
> Program follows:
> ---
> from difflib import SequenceMatcher as SM
>
> src = u"N KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN
> RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP
> FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T A SSLST
> ANTSTRL SST"
> trg = u"M KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN
> RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP
> FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T SSLST
> ANTSTRL SST"
> print src, '\n', trg, '\n', SM(None, trg, src).ratio()



More information about the Python-list mailing list