.title() - annoying mistake

Chris Angelico rosuav at gmail.com
Fri Mar 19 14:02:14 EDT 2021


On Sat, Mar 20, 2021 at 4:46 AM Karen Shaeffer via Python-list
<python-list at python.org> wrote:
>
>
>
> > On Mar 19, 2021, at 9:42 AM, Grant Edwards <grant.b.edwards at gmail.com> wrote:
> >
> > On 2021-03-19, Skip Montanaro <skip.montanaro at gmail.com> wrote:
> >>>
> >>> That's annoying. You have to roll your own solution!
> >>>
> >>
> >> Certainly seems like a known issue:
> >>
> >> https://bugs.python.org/issue12737
> >
> > While that is an issue with string.title(), I don't see how it's
> > related to what the OP is reporting. Issue 12737 is about Unicode
> > combining marks.
>
> Hi,
> I’ve been frustrated by my experiences processing unstructured multilingual text with python. I’ve always assumed this was due to my insufficient experience with python (3) text processing. I’ve recently begun coding with Go. (I also continue to code in Python) And Go has exceptionally crisp and clear capacity to process unstructured multilingual utf-8 encoded text.
>
> In just a few days of working with text processing in Go, using the book “The Go Programming Language” by Donovan and Kernighan, along with the Go language specification and other free online help, I have acquired a clear and crisp understanding of how to work effectively with unstructured, multilingual utf-8 encoded text (and emojis) and any unicode code point — even invalid unicode code points.
>
> To see some of these issues first hand, write a palindrome detector that works with any sequence of utf-8 encoded code points, including invalid code points. I’m sure it can be done in python, although I’ve not done it. It’s a trivial exercise in Go.
>
> I’m not bashing Python here. I will continue to code with python. Its an exceptional language and community. Just commenting on my experience.
>

Python doesn't work with UTF-8 encoded code points; it works with
Unicode code points. Are you looking for something that checks whether
something is a palindrome, or locates palindromes within it?

def is_palindrome(txt):
    return txt == txt[::-1]

Easy.

Efficiently finding substring palindromes would be a bit harder, but
that'd be true even if you restricted it to ASCII. The advantage of
Python's way of doing it is that, if you have a method that would work
with ASCII bytes, the exact same thing will work with a Unicode
string.

There's another big wrinkle not touched here, and that's what to do
with combining characters. Python makes it easy to normalize text as
much as is possible, and an NFC normalization would help a lot, but
it's not going to do everything. So you may want to first define a
proper way to split a string into whatever you're defining a character
to be, and that's a very difficult problem, regardless of programming
language. For example, Arabic text changes in visual shape when
letters are next to each other, and Greek has two different forms for
the letter sigma (U+03C2 and U+03C3) - should those distinctions
affect palindromminess? What about ligatures - is U+FB01 "fi" a single
character, or should it be matched by "if" on the other end?

What part of this is trivial in Go?

ChrisA


More information about the Python-list mailing list