[Python-Dev] fun with unicode, part 1

Thu, 27 Apr 2000 16:21:20 GMT

On Thu, 27 Apr 2000 11:23:50 -0400, you wrote:

>> >>> filename = u"gröt"
>> 
>> >>> file = open(filename, "w")
>> >>> file.close()
>> 
>> >>> import glob
>> >>> print glob.glob("gr*")
>> ['gr\303\266t']
>> 
>> >>> print glob.glob(u"gr*")
>> [u'gr\366t']
>> 
>> >>> import os
>> >>> os.system("dir gr*")
>> ...
>> GRÇôT                    0  01-02-03  12.34 grÇôt
>>          1 fil(es)              0 byte
>>          0 dir         12 345 678 byte free
>> 
>> hmm.
>
>I presume that Fredrik's gripe is that the filename has been converted
>to UTF-8, while the encoding used by Windows to display his directory
>listing is Latin-1.  (Not Microsoft's own 8-bit character set???)
>
>I'd like to solve this problem, but I have some questions: what *IS*
>the encoding used for filenames on Windows? 

[This is just for inspiration]

JDK "solves" this by running the filename through a CharToByteConverter
(a codec) which is setup as the default encoding used for the platform.
On my danish w2k this is encoding happens to be called 'Cp1252'.

The codec name is chosen based on the users language and region with
fall back to Cp1252. The mapping table is:

    "ar", "Cp1256",
    "be", "Cp1251",
    "bg", "Cp1251",
    "cs", "Cp1250",
    "el", "Cp1253",
    "et", "Cp1257",
    "iw", "Cp1255",
    "hu", "Cp1250",
    "ja", "MS932",
    "ko", "MS949",
    "lt", "Cp1257",
    "lv", "Cp1257",
    "mk", "Cp1251",
    "pl", "Cp1250",
    "ro", "Cp1250",
    "ru", "Cp1251",
    "sh", "Cp1250",
    "sk", "Cp1250",
    "sl", "Cp1250",
    "sq", "Cp1250",
    "sr", "Cp1251",
    "th", "MS874",
    "tr", "Cp1254",
    "uk", "Cp1251",
    "zh", "GBK",
    "zh_TW", "MS950",

>This may differ per
>Windows version; perhaps it can differ drive letter?  Or per
>application or per thread?  On Windows NT, filenames are supposed to
>be Unicode.  (I suppose also on Windowns 2000?) 

JDK only uses GetThreadLocale() for the starting thread. It does not
appears to check for windows versions at all.

>How do I open a file
>with a given Unicode string for its name, in a C program?  I suppose
>there's a Win32 API call for that which has a Unicode variant.

The JDK does not make use the unicode API is it exists on the platform.

>On Windows 95/98, the Unicode variants of the Win32 API calls don't
>exist.  So what is the poor Python runtime to do there?
>
>Can Japanese people use Japanese characters in filenames on Windows
>95/98?  Let's assume they can.  Since the filesystem isn't Unicode
>aware, the filenames must be encoded.  Which encoding is used?  Let's
>assume they use Microsoft's multibyte encoding.  If they put such a
>file on a floppy and ship it to Linköping, what will Fredrik see as
>the filename?  (I.e., is the encoding fixed by the disk volume, or by
>the operating system?)
>
>Once we have a few answers here, we can solve the problem.  Note that
>sometimes we'll have to refuse a Unicode filename because there's no
>mapping for some of the characters it contains in the filename
>encoding used. 

JDK silently replaced the offending character with a '?' which cause an
exception when attempting to open the file.

  The filename, directory name, or volume label syntax is incorrect

>Question: how does Fredrik create a file with a Euro
>character (u'\u20ac') in its name?

import java.io.*;

public class x {
  public static void main(String[] args) throws Exception {
    String filename = "An eurosign \u20ac";
    System.out.println(filename);
    new FileOutputStream(filename).close();
  }
}

The resulting file contains an euro sign when shown in FileExplorer. The
output of the program also contains an euro sign when shown with
notepad. But the filename/program output does *not* contain an euro when
dir'ed/type'd in my DOS box.

regards,
finn