unicode direction control characters

Chris Angelico rosuav at gmail.com
Tue Jan 2 10:18:14 EST 2018


On Wed, Jan 3, 2018 at 1:30 AM, Robin Becker <robin at reportlab.com> wrote:
> I'm seeing some strange characters in web responses eg
>
> u'\u200e28\u200e/\u200e09\u200e/\u200e1962'
>
> for a date of birth. The code \u200e is LEFT-TO-RIGHT MARK according to
> unicodedata.name.  I tried unicodedata.normalize, but it leaves those
> characters there. Is there any standard way to deal with these?
>
> I assume that some browser+settings combination is putting these in eg
> perhaps the language is normally right to left but numbers are not.

Unicode normalization is a different beast altogether. You could
probably just remove the LTR marks and run with the rest, though, as
they don't seem to be important in this string.

ChrisA



More information about the Python-list mailing list