unicode direction control characters

Tue Jan 2 10:55:23 EST 2018

On Wed, Jan 3, 2018 at 2:36 AM, Robin Becker <robin at reportlab.com> wrote:
> On 02/01/2018 15:18, Chris Angelico wrote:
>>
>> On Wed, Jan 3, 2018 at 1:30 AM, Robin Becker <robin at reportlab.com> wrote:
>>>
>>> I'm seeing some strange characters in web responses eg
>>>
>>> u'\u200e28\u200e/\u200e09\u200e/\u200e1962'
>>>
>>> for a date of birth. The code \u200e is LEFT-TO-RIGHT MARK according to
>>> unicodedata.name.  I tried unicodedata.normalize, but it leaves those
>>> characters there. Is there any standard way to deal with these?
>>>
>>> I assume that some browser+settings combination is putting these in eg
>>> perhaps the language is normally right to left but numbers are not.
>>
>>
>> Unicode normalization is a different beast altogether. You could
>> probably just remove the LTR marks and run with the rest, though, as
>> they don't seem to be important in this string.
>>
>> ChrisA
>>
> I guess I'm really wondering whether the BIDI control characters have any
> semantic meaning. Most numbers seem to be LTR.
>
> If I saw u'\u200f12' it seems to imply that the characters should be
> displayed '21', but I don't know whether the number is 12 or 21.
>

In this particular situation, it's highly unlikely that they'll have
any influence, and even if they do, I don't think there's any way for
the *repeated* directionality markers to do anything. They look like
something added automatically for the sake of paranoia.

ChrisA