urllib unqoute providing string mismatch between string found using os.walk (Python3)

Pieter van Oostrum pieter-l at vanoostrum.org
Sat Dec 21 17:23:17 EST 2019


Ben Hearn <benandrewhearn at gmail.com> writes:

> Hello all,
>
> I am having a bit of trouble with a string mismatch operation in my tool I am writing.
>
> I am comparing a database collection or url quoted paths to the paths on the users drive.
>
> These 2 paths look identical, one from the drive & the other from an xml url:
> a = '/Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! _PromoMix_.wav'
> b = '/Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! _PromoMix_.wav'
>
> But after realising it was failing on them I ran a difflib and these differences popped up.
>
> import difflib
> print('\n'.join(difflib.ndiff([a], [b])))
> - /Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! _PromoMix_.wav
> ? ^^
> + /Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! _PromoMix_.wav
> ? ^
>
>
> What am I missing when it comes to unquoting the string, or should I do
> some other fancy operation on the drive string?
>

In [8]: len(a)
Out[8]: 79

In [9]: len(b)
Out[9]: 78

The difference is in the ó. In (b) it is a single character, Unicode 0xF3,
LATIN SMALL LETTER O WITH ACUTE.
In (a) it is composed of the letter o and the accent "́" (Unicode 0x301).
So you would have to do Unicode normalisation before comparing.

For example:

In [16]: from unicodedata import normalize

In [17]: a == b
Out[17]: False

In [18]: normalize('NFC', a) == normalize('NFC', b)
Out[18]: True

-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]


More information about the Python-list mailing list