Best ways of managing text encodings in source/regexes?
tvn
tinkerbarbet at gmail.com
Sun Dec 9 12:36:31 EST 2007
Please see the correction from Cliff pasted here after this excerpt.
Tim
> the byte string is ASCII which is a subset of Unicode (IS0-8859-1
> isn't).)
The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.
Thus a file encoded in ascii *is* in fact a utf-8 file. There is no
way
to distinguish the two. But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.
More information about the Python-list
mailing list