[Tutor] UTF-8 filenames encountered in os.walk

Terry Carroll carroll at tjc.com
Thu Jul 5 01:43:51 CEST 2007


On Wed, 4 Jul 2007, Kent Johnson wrote:

> Terry Carroll wrote:
> > Now, superficially, s and e8 are equal, because for plain old ascii 
> > characters (which is all I've used in this example), UTF-8 is equivalent 
> > to ascii.  And they compare the same:
> > 
> >>>> s == e8
> > True
> 
> They are equal in every sense, I don't know why you consider this 
> superficial. And if your original string was not ascii the encode() 
> would fail with a UnicodeDecodeError.

Superficial in the sense that I was using only characters in the ascii
character set, so that the same byte encoding in UTF-8.

so: 

>>> 'abc'.decode("UTF-8")
u'abc'

works

But UTF-8 can hold other characters, too; for example

>>> '\xe4\xba\xba'.decode("UTF-8")
u'\u4eba'

(Chinese character for "person")

I'm just saying that UTF-8 encodes ascii characters to themselves; but 
UTF-8 is not the same as ascii.

I think we're ultimately saying the same thing; to merge both our ways of
putting it, I think, is that ascii will map to UTF-8 identically; but
UTF-8 may map back or it will raise UnicodeDecodeError.

I just didn't want to leave the impression "Yeah, UTF-8 & ascii, they're
the same thing."



More information about the Tutor mailing list