[Python-Dev] Filename as byte string in python 2.6 or 3.0?
Victor Stinner
victor.stinner at haypocalc.com
Sat Sep 27 14:04:25 CEST 2008
Hi,
I read that Python 2.6 is planned to Wednesday. One bug is is still open and
important for me: Python 2.6/3.0 are unable to use filename as byte strings.
http://bugs.python.org/issue3187
The problem
===========
On Windows, all filenames are unicode strings (I guess UTF-16-LE), but on UNIX
for historical reasons, filenames are byte strings. On Linux, you can expect
UTF-8 valid filenames but sometimes (eg. copy from a FAT32 USB key to an ext3
filesystem) you get invalid filename (byte string in a different charset than
your default filesystem encoding (utf8)).
Python functions using filenames
================================
In Python, you have (incomplete list):
- filename producer: os.listdir(), os.walk(), glob.glob()
- filename manipulation: os.path.*()
- access file: open(), os.unlink(), shutil.rmtree()
If you give unicode to producer, they return unicode _or_ byte strings (type
may change for each filename :-/). Guido proposed to break this behaviour:
raise an exception if unicode conversion fails. We may consider an option
like "skip invalid".
If you give bytes to producer, they only return byte strings. Great.
Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with
the type "bytes". So you can not use os.path.join(<your unicode path>, <bytes
filename>) *nor* os.path.join(<your bytes path>, <bytes filename>) because
os.path.join() (eg. with the posix version) uses path.endswith('/').
Access file: open() rejects the type bytes (it's just a test, open() supports
bytes if you remove the test). As I remember, unlink() is compatible with
bytes. But rmtree() fails because it uses os.path.join() (even if you give
bytes directory, join() fails).
Solutions
=========
- producer: unicode => *only* unicode // bytes => bytes
- manipulation: support both unicode and bytes but avoid (when it's possible)
to mix bytes and characters
- open(): allow bytes
I implemented these solutions as a patch set attached to the issue #3187:
* posix_path_bytes.patch: fix posixpath.join() to support bytes
* io_byte_filename.patch: open() allows bytes filename
* fnmatch_bytes.patch: patch fnmatch.filter() to accept bytes filenames
* glob1_bytes.patch: fix glob.glob() to accept invalid directory name
Mmmh, there is no patch for stop os.listdir() on invalid filename.
Priority
========
I think that the problem is important because it's a regression from 2.5 to
2.6/3.0. Python 2.5 uses bytes filename, so it was possible to
open/unlink "invalid" unicode strings (since it's not unicode but bytes).
Well, if it's too late for the final versions, this problem should be at least
fixed quickly.
Test the problem
================
Example to create invalid filenames on Linux:
$ mkdir /tmp/test
$ cd /tmp/test
$ touch $(echo -e "a\xffb")
$ mkdir $(echo -e "dir\xffname")
$ touch $(echo -e "dir\xffname/file")
$ find
.
./a?b
./dir?name
./dir?name/file
Python 2.5:
>>> import os
>>> os.listdir('.')
['a\xffb', 'dir\xffname']
>>> open(os.listdir('.')[0]).close() # open file: ok
>>> os.unlink(os.listdir('.')[0]) # remove file: ok
>>> os.listdir('.')
['dir\xffname']
>>> shutil.rmtree(os.listdir('.')[0]) # remove dir: ok
Wrong solutions
===============
New type
--------
I proposed an ugly type "InvalidFilename" mixing bytes and characters. As
everybody using unicode knows, it's a bad idea :-) (and it introduces a new
type).
Convert bytes to unicode (replace)
----------------------------------
unicode_filename = unicode(bytes_filename, charset, "replace")
Ok, you will get valid unicode strings which can be used in os.path.join() &
friends, but open() or unlink() will fails because this filename doesn't
exist!
--
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
More information about the Python-Dev
mailing list