Problem with japanese characters in filenames

jay.krell at cornell.edu jay.krell at cornell.edu
Tue Oct 17 00:51:49 EDT 2000


Yes, backslash is the "trailing character" in one or more "multibyte"
sequences.
When scanning "multibyte" strings on Windows you are not supposed to just
add or subtract one, but use functions like CharNextA or CharPrevA, or the
Microsoft C library's similar but different _mbsinc, _mbsdec sometimes known
as _tcsinc, _tcsdec.

Yes, Unicode helps. If all your strings are unicode and you ignore the fact
that Unicode also has "multibyte" sequences, then you can go back to just
adding or subtracting one. Unicode "multibyte" sequences are apparently a
newer/rarer thing than 8bit multibyte sequences, so, though it seems wrong,
you can probably get away with this, at least for a few years.

At least with Unicode there is sort of only one, or at least fewer, "code
pages". (You could consider UTF7, UTF8 and Java UTF8 as Unicode code
pages...and I'm a bit ignorant, but the generalized terms are probably UCS7,
UCS8, Java UCS8, UCS16, UCS32. UCS16 being the normal "big/wide/large/etc."
representation. Java UTF8/UCS8 changing the representation of 0, how nice of
Sun to follow standards...)

Yes I know "multibyte" isn't the right term when talking about Unicode,
since even a "single byte" Unicode sequence is multiple bytes.

And Unicode "code page" isn't really right either, since the main variable
isn't really what "page" is used to decode the values, but rather what size
the values are.

But see Neil Hodgson's response too, even though Unicode "helps", there
aren't necessarily bugs without it.

 - Jay

-----Original Message-----
From: Jan Wender <ian at leo.science-computing.de>
Newsgroups: comp.lang.python
To: python-list at python.org <python-list at python.org>
Date: Monday, October 16, 2000 12:07 PM
Subject: Problem with japanese characters in filenames


>With Python 1.5.2 I had problems with japanese characters in filenames
>on MS Windows. In the native character encoding (952, I believe) the
>backslash is an allowed second character of a multibyte sequence. This
>broke os.listdir, because it splits directory and file names at
>backslashes.
>Has somebody experience with this with newer Python versions? Does
>unicode help here in any way? As far as I know, the windows
>interfacing part of Python is written with the 8bit functions.
>Thanks for any light on this,
>
>Cheerio,
>--
>J.Wender at science-computing.de - Fon +4970719457-257 Fax-27
>Living With Other People: Civilization is a religion. (Talking Heads)
>--
>http://www.python.org/mailman/listinfo/python-list





More information about the Python-list mailing list