How do I display unicode-paths?

Mon Oct 6 05:53:17 EDT 2003

> From: Jp Calderone [mailto:exarkun at intarweb.us] 
> 
> On Sat, Oct 04, 2003 at 08:04:33PM -0500, Pettersen, Bjorn S wrote:
> > I've been trying to stay blissfully unaware of Unicode, 
> > however now it seems like it's my turn. From the outside it 
> > seems like a rather massive subject, so any pointers as to where 
> > I should _start_ reading would be appreciated. The usecase is a 
> > class:
> > 
> >   class path(object):
> >      ...
> >      def __str__(self):
> >         return self.pathstr.encode(???)
> > 
> > the question is what to put at ??? to be most useful to 
> > programmers/end users?
> 
> 
>     class Path(object):
>         encoding = sys.getdefaultencoding()
> 
>         def __str__(self):
>             return self.pathstr.encode(self.encoding)
> 
> 
>   is the path (ha ha) I would take.  Unsurprising default, and easily
> configurable.

I think you mean locale.getpreferredencoding(), however that's not enough to get results that non-unicode aware programmers would find unsurprising <wink>. A compromise might be to always use encoding 'unicode_escape' or simil., but it doesn't provide much visual feedback during development/debugging...

(preface: if it seems like I know what I'm talking about below, it's probably copied from someone that actually do. If I'm off in semantics or terminology feel free to correct :-), oh, and excuse the length... I'd trim it to what's relevant -- if I only knew what that was... (*sigh*)).

I'll try to define my problem a little more... I'm getting path information from os.listdir(os.getcwdu()) as unicode strings, and I would like for:

  (a) the class to work like a regular class, i.e. __str__ should be 
      defined and give a pretty (presentable?) representation of the 
      path, at least during development and debugging. I don't mind 
      a separate routine to get something that can be displayed as 
      part of an end-user interface, and e.g. a 'place-holder' char
      would work fine, but a variable, mostly legitimate glyph that
      does not look like the glyph for the unicode code-point used by
      the file system (no matter how close their numeric values are)
      probably isn't.
  (b) it follows (at least for me :-) that defining __str__ shouldn't 
      cause tracebacks during development for programmers using the 
      module (almost anything is more useful to my non-unicode mindset
      than "ordinal not in range(128).." <wink>). This also means that
      I'm very willing to give up any ability to go from the __str__ 
      representation back to unicode.
  (c) mimicing the output of the "dir" command in cmd.exe and what is 
      shown in Windows Explorer for a given file would be ideal (i.e. 
      in this case u'\xe6' causes the grapheme [æ] to be displayed --
      a.k.a. "DOS" does it, so it ought be possible", to be proven
      naive again, I'm sure :-).
  (d) the class also needs to accept programmatically input pathnames,
      e.g. either of the following would be ideal:

        ferry = mydocs.relpath('../færje')
        ferry = mydocs.relpath(u'../færje')

      provided it could be converted to valid file system entries. 
      Whether the code is input using Python.exe in cmd.exe, or a 
      text editor and then run any way a Python program can, should 
      ideally not change behaviour. (I'm hoping a programmer that can 
      see the filename both in DOS and Explorer, and type it directly 
      in either place doesn't have to look up unicode code points?)
  (e) Translating user input to valid file system entities would also
      be nice but seems to be a bit premature...

The issues I'm seeing and/or have been made aware of:

0. Windows. In particular, cmd.exe seems to be an island of its own, and
   I care because that's where I do most of my interactive Python
   use. Using the charmap.exe utility (DOS: Western Europe and United
   States [Windows:* is ok], advanced view, statusbar), I've got:

     Latin Small Letter Ae[æ] U+00E6 (0x91) Keystroke: Alt+0230
     Micro Sign[µ]            U+00B5 (0xE6) Keystroke: Alt+0181

   Note the correspondence between the U+00E6 code-point[term.?] for 'Ae'
   and the "ascii numerical value on the current code page" (pure guess,
   I really have no idea what, exactly, the '0x..' values are) for 'Micro'
   (0xE6). You can probably guess the result.. cmd.exe/Python gives:

     u"æ" -> u'\x91'        # this value is not a unicode code point
     u"µ" -> u'\xe6'        # this is, but not for the grapheme µ
     hex(ord('æ')) == 0x91
     hex(ord('µ')) == 0xE6

   while PythonWin or IDLE gives (the correct):

     u"æ" -> u'\xe6'
     u"µ" -> u'\xb5'
     hex(ord('æ')) == 0xE6
     hex(ord('µ')) == 0xB5

1. __str__ must return a string, and specifically not a unicode string,
   since the latter would go through a default encode ('ascii'?) causing
   an exception:

    >>> class foo(object):
    ...   def __str__(self):
    ...    return u'\xe6'
    ...
    >>> print foo()
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character '\ue6' in position 0: ordinal not in range(128)

2. At least some unicode strings can be printed (it seems sys.stdout
   receives [u'\xe6', '\n'] for the example below, but since I now know
   that cmd.exe uses cp850 and u'\xe6'.encode('cp850') == '\x91', i.e.
   [æ] in the dos window, I'm assuming sys.stdout (even lower level?) is 
   somehow doing an encoding under the covers...?:

    >>> print u'\xe6'
    æ

   however,

    >>> print u'æ'
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "e:\python23\lib\encodings\cp437.py", line 18, in encode
        return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode character '\u91' in position 0: character maps to <undefined>

   I'm guessing because u'æ' == u'\x91' is not a valid code point?,
   but I'm not sure I understand where cp437 comes in... oh, well :-)

The outstanding issues seem to be:

 Craft a __str__ method that converts u'\xe6' to '\x91' or '\xe6'
 depending on ???
 i.  What is ??? above? Whether run in cmd.exe? Is this just to
     evil to contemplate, and should I quit while ahead and go with
     'unicode_escape' or something similar?
 ii. Is it possible to determine if running in cmd.exe (if not, are
     there any good heuristics?)
 ii. Should cp850 always be used (I searched HKCU, HKLM, HKCC, and HKU
     in the registry without finding a single reference to cp850..), or 
     is that locale dependent? E.g. I noticed that (and I'm assuming 
     there is a reason for):

       print u'\xe6'.encode(cpXXX)

     gives (WinXP, en_US)

       æ (Ae) for cp437, 775, 850, 857, 861, and 865
       £ (Br.Pound) for cp500, 1026, and 1140
       µ (Micro) for cp1252, 1254, and 1258
       ┐ (dos gr. chr.) for cp1258

 Convert input arguments from ['æ'|u'æ'] either '\x91' or u'\x91' 
 (cmd.exe); and either '\xe6' or u'\xe6' (everywhere else) to the
 correct (at least for the file system) u'\xe6'.
 i.  I'm not even sure this is possible in the general case(?) E.g.
     I'm assuming there would be an interaction with source file
     encodings, although I'm about as familiar with that as with 
     unicode...

 Display u'\xe6' (gotten from os.listdir or input arguments) as [æ]:
 i.  Doesn't seem possible, in general, even if converted to '\x91'
     since the read-eval-print loop displays \x?? codes for the 
     extended ascii range. Using "print" ['\xe6'|u'\xe6'] displays 
     [æ] in all of cmd.exe, PythonWin, and IDLE, however, "print 
     ['\x91'|u'\x91]" displays [‘] or [‘] respectively in PythonWin...
     (i.e. __str__ better not do that :-)

Is this really confusing, am I really dense, or should we blame it all on Microsoft and wait for cmd.exe to die <wink>? Perhaps I'd be better off to just push the data around and forget about the 'AI'?

wonder-how-my-linux-machine-is-doing...
-- bjorn

ps: "os.system('mkdir æ')" vs. the same from the dos prompt, followed by "os.listdir('.')" or "os.listdir(u'.')", and the relevance of \u2018 left for the truly dedicated <sigh>.