[Tutor] UTF-8 filenames encountered in os.walk

William O'Higgins Witteman hmm at woolgathering.cx
Wed Jul 4 18:00:08 CEST 2007


On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote:

>FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8
>strings, they are not the same thing. A Unicode string uses 16 bits to
>represent each character. It is a distinct data type from a 'regular'
>string. Regular Python strings are byte strings with an implicit
>encoding. One possible encoding is UTF-8 which uses one or more bytes to
>represent each character.
>
>Some good reading on Unicode and utf-8:
>http://www.joelonsoftware.com/articles/Unicode.html
>http://effbot.org/zone/unicode-objects.htm

The problem is that the Windows filesystem uses UTF-8 as the encoding
for filenames, but os doesn't seem to have a UTF-8 mode, just an ascii
mode and a Unicode mode.

>If you pass a unicode string (not utf-8) to os.walk(), the resulting 
>lists will also be unicode.
>
>Again, it would be helpful to see the code that is getting the error.

The code is quite complex for not-relevant-to-this-problem reasons.  The
gist is that I walk the FS, get filenames, some of which get written to
an XML file.  If I leave the output alone I get errors on reading the
XML file.  If I try to change the output so that it is all Unicode, I
get errors because my UTF-8 data sometimes looks like ascii, and I don't
see a UTF-8-to-Unicode converter in the docs.

>>I suspect that my program will have to make sure to recast all
>>equivalent-to-ascii strings as UTF-8 while leaving the ones that are
>>already extended alone.
>
>It is nonsense to talk about 'recasting' an ascii string as UTF-8; an 
>ascii string is *already* UTF-8 because the representation of the 
>characters is identical. OTOH it makes sense to talk about converting an 
>ascii string to a unicode string.

Then what does mystring.encode("UTF-8") do?
-- 

yours,

William


More information about the Tutor mailing list