[Tutor] UTF-8 filenames encountered in os.walk
Terry Carroll
carroll at tjc.com
Thu Jul 5 01:43:51 CEST 2007
On Wed, 4 Jul 2007, Kent Johnson wrote:
> Terry Carroll wrote:
> > Now, superficially, s and e8 are equal, because for plain old ascii
> > characters (which is all I've used in this example), UTF-8 is equivalent
> > to ascii. And they compare the same:
> >
> >>>> s == e8
> > True
>
> They are equal in every sense, I don't know why you consider this
> superficial. And if your original string was not ascii the encode()
> would fail with a UnicodeDecodeError.
Superficial in the sense that I was using only characters in the ascii
character set, so that the same byte encoding in UTF-8.
so:
>>> 'abc'.decode("UTF-8")
u'abc'
works
But UTF-8 can hold other characters, too; for example
>>> '\xe4\xba\xba'.decode("UTF-8")
u'\u4eba'
(Chinese character for "person")
I'm just saying that UTF-8 encodes ascii characters to themselves; but
UTF-8 is not the same as ascii.
I think we're ultimately saying the same thing; to merge both our ways of
putting it, I think, is that ascii will map to UTF-8 identically; but
UTF-8 may map back or it will raise UnicodeDecodeError.
I just didn't want to leave the impression "Yeah, UTF-8 & ascii, they're
the same thing."
More information about the Tutor
mailing list