.title() - annoying mistake

Chris Angelico rosuav at gmail.com
Mon Mar 22 14:58:24 EDT 2021


On Tue, Mar 23, 2021 at 5:16 AM Karen Shaeffer via Python-list
<python-list at python.org> wrote:
>
> Hi Chris,
> Thanks for your comment.
>
> > Python doesn't work with UTF-8 encoded code points; it works with
> > Unicode code points. Are you looking for something that checks whether
> > something is a palindrome, or locates palindromes within it?
> >
> > def is_palindrome(txt):
> >    return txt == txt[::-1]
> >
> > Easy.
>
> Of course, its easy. Its a pythonic idiom! But it doesn’t work. And you know that. You even explained a few reasons why it doesn’t work below. There are many more instances of strings that do not work. Here are two:
>
> idx = 6    A man, a plan, a canal: Panama   is_palindrome() = False
> idx = 17    ab́cdeedcb́a   is_palindrome() = False
>
> The palindrome isn’t worth any more time. It isn’t even a good example.
>
> In my experience processing unstructured, multilingual text, you encounter a wide array of variances in both the text and in the encoding details, including outright errors. You have to account for all of them, because 99.99% of that text is valuable to you.
>
> The key idea: If you care about the details, working with unstructured multi-lingual text is complicated. There are no easy solutions.
>
>
> >
> > Efficiently finding substring palindromes would be a bit harder, but
> > that'd be true even if you restricted it to ASCII. The advantage of
> > Python's way of doing it is that, if you have a method that would work
> > with ASCII bytes, the exact same thing will work with a Unicode
> > string.
> >
> > There's another big wrinkle not touched here, and that's what to do
> > with combining characters. Python makes it easy to normalize text as
> > much as is possible, and an NFC normalization would help a lot, but
> > it's not going to do everything. So you may want to first define a
> > proper way to split a string into whatever you're defining a character
> > to be, and that's a very difficult problem, regardless of programming
> > language. For example, Arabic text changes in visual shape when
> > letters are next to each other, and Greek has two different forms for
> > the letter sigma (U+03C2 and U+03C3) - should those distinctions
> > affect palindromminess? What about ligatures - is U+FB01 "fi" a single
> > character, or should it be matched by "if" on the other end?
> >
> > What part of this is trivial in Go?
>
> Go is simpler than Python. Both languages have the capabilities to solve any text processing problem. I’m still learning Go, so I can’t really say more.
>
> Personally, I like Python for text processing. You can usually get satisfactory results very quickly for most of the input space. And if you don’t care about all the gotchas, then you are good to go.
>
> I have no more time for this. Thanks for your comment. I learned a little reading the long thread dealing with .title(). (chuckles ;)
>

Hey, you're the one who brought up palindrome testing as a difficult
problem in Python :) Your post implied that it was easier in Go, and I
can't see that that's possible.

ChrisA


More information about the Python-list mailing list